Skip to main content
SHARE
Publication

Dense LU Factorization on Multicore Supercomputer Nodes...

Publication Type
Conference Paper
Publication Date
Conference Name
26th IEEE International Parallel & Distributed Processing Symposium (IEEE IPDPS 2012)
Conference Location
Shanghai, China
Conference Sponsor
IEEE
Conference Date

Dense LU factorization is a prominent benchmark used to rank the performance of supercomputers. Many implementations, including the reference code HPL, use block-cyclic distributions of matrix blocks onto a two-dimensional process grid. The process grid dimensions drive a trade-off between communication and computation and are architecture- and implementation-sensitive. We show how the critical panel factorization steps can be made less communication-bound by overlapping asynchronous collectives for pivot identification and exchange with the computation of rank-k updates. By shifting this trade-off, a modified block-cyclic distribution can beneficially exploit more available parallelism on the critical path, and reduce panel factorization's memory hierarchy contention on now-ubiquitous multi-core architectures. The missed parallelism in traditional block-cyclic distributions arises because active panel factorization, triangular solves, and subsequent broadcasts are spread over single process columns or rows (respectively) of the process grid. Increasing one dimension of the process grid decreases the number of distinct processes in the other dimension. To increase parallelism in both dimensions, periodic 'rotation' is applied to the process grid to recover the row-parallelism lost by a tall process grid. During active panel factorization, rank-1 updates stream through memory with minimal reuse. In a column-major process grid, the performance of this access pattern degrades as too many streaming processors contend for access to memory. A block-cyclic mapping in the more popular row-major order does not encounter this problem, but consequently sacrifices node and network locality in the critical pivoting steps. We introduce 'striding' to vary between the two extremes of row- and column-major process grids. As a test-bed for further mapping experiments, we describe a dense LU implementation that allows a block distribution to be defined as a general function of block to processor. Other mappings can be tested with only small, local changes to the code.