Implementing High-performance Complex Matrix Multiplication via the 3m and 4m Methods

Abstract

In this article, we explore the implementation of complex matrix multiplication. We begin by briefly identifying various challenges associated with the conventional approach, which calls for a carefully written kernel that implements complex arithmetic at the lowest possible level (i.e., assembly language). We then set out to develop a method of complex matrix multiplication that avoids the need for complex kernels altogether. This constraint promotes code reuse and portability within libraries such as Basic Linear Algebra Subprograms and BLAS-Like Library Instantiation Software (BLIS) and allows kernel developers to focus their efforts on fewer and simpler kernels. We develop two alternative approaches—one based on the 3 m method and one that reflects the classic 4 m formulation—each with multiple variants, all of which rely only on real matrix multiplication kernels. We discuss the performance characteristics of these “induced” methods and observe that the assembly-level method actually resides along the 4 m spectrum of algorithmic variants. Implementations are developed within the BLIS framework, and testing on modern hardware confirms that while the less numerically stable 3 m method yields the fastest runtimes, the more stable (and thus widely applicable) 4 m method’s performance is somewhat limited due to implementation challenges that appear inherent in nature.

References

Page 1

	Year	Citations

Page 1