|Name:||Algorithms & Analysis(3) Optimizing the Scalable Direct Eigenvalue Solver ELPA on State-of-the-Art x86 Architectures|
|Time:||Monday, June 17, 2013
1:35 PM - 1:40 PM
|Room:||Multi-Purpose Area 4 (MPA 4)
CCL - Congress Center Leipzig
|Speakers:||Alexander Heinecke, TU München|
|Abstract:||ELPA is a new and efficient distributed parallel direct eigenvalue solver for symmetric matrices. It contains both, an improved one-step ScaLAPACK solver (ELPA1) and a tow-step solver called ELPA2. In both cases ELPA employs a ScaLAPACK memory layout. In case of ELPA1 basically the same algorithmic structure as in ScaLAPACK is used (tridiagonalization through Householder transformation, solving the eigenvalue problem, backtransformation of eigenvectors) but for ELPA2 the tridiagonalization is replaced by a two step variant. It derives a banded shaped matrix from the original dense matrix which is afterwards transformed to a tridiagonal matrix. Doing so higher scalability can be achieved what results in total in a shorter time to solution.
However, due to the two-step tridiagonalization also two-step backtransformation of the eigenvectors is required. Both steps are compute-bound but cannot be formulated by using Level 3 BLAS routines. Therefore custom hardware-aware kernels were implemented which supports Intel's Sandy Bridge architecture and as well AMD's Bulldozer architecture. Here we make use of the new AVX vector extensions and in case
of Bulldozer fused-multiply-adds.
The poster presents a systematic performance analysis of ELPA on SuperMUC (a 3 PFLOPS Sandy Bridge Cluster) and a 1024 core AMD Bulldozer cluster. We test both, real and complex datatypes, and compare ELPA's performance to the corresponding PDSYEVD and PZHEEVD routines taken from ScaLAPACK.