High performance for numerical linear algebra often comes at the expense of stability. Computing the LU decomposition of a matrix via Gaussian Elimination can be organized so that the computation involves regular and efficient data access. However, maintaining numerical stability via partial pivoting involves row interchanges that lead to inefficient data access patterns. To optimize communication efficiency throughout the memory hierarchy we confront two seemingly contradictory requirements: partial pivoting is efficient with column-major layout, whereas a block-recursive layout is optimal for the rest of the computation. We resolve this by introducing a shape morphing procedure that dynamically matches the layout to the computation throughout the algorithm, and show that Gaussian Elimination with partial pivoting can be performed in a communication efficient and cache-oblivious way. Our technique extends to QR decomposition, where computing Householder vectors prefers a different data layout than the rest of the computation.
2D Accelerators Algorithms Architectures Arrays Big Data Bootstrapping C++ Cache Partitioning Cancer Careers Chisel Communication Computer Architecture CTF DIABLO Efficiency Energy FPGA GAP Gaussian Elimination Genomics GPU Hardware HLS Lower Bounds LU Matrix Multiplication Memory Multicore Oblivious Open Space OS Parallelism Parallel Reduction Performance PHANTOM Processors Python Research Centers RISC-V SEJITS Tall-Skinny QR Technical Report Test generation