This paper derives tradeoffs between three basic costs of a parallel algorithm: synchronization, data movement, and computational cost. Our theoretical model counts the amount of work and data movement as a maximum of any execution path during the parallel computation. By considering this metric, rather than the total communication volume over the whole machine, we obtain new insight into the characteristics of parallel schedules for algorithms with non-trivial dependency structures. The tradeoffs we derive are lower bounds on the execution time of the algorithm which are independent of the number of processors, but dependent on the problem size. Therefore, these tradeoffs provide lower bounds on the parallel execution time of any algorithm computed by a system composed of any number of homogeneous components each with associated computational, communication, and synchronization payloads. We first state our results for general graphs, based on expansion parameters, then we apply the theorem to a number of specific algorithms in numerical linear algebra, namely triangular substitution, Gaussian elimination, and Krylov subspace methods. Our lower bound for LU factorization demonstrates the optimality of Tiskin’s LU algorithm  answering an open question posed in his paper, as well as of the 2.5D LU algorithm which has analogous costs. We treat the computations in a general manner by noting that the computations share a similar dependency hypergraph structure and analyzing the communication requirements of lattice hypergraph structures.
2D Accelerators Algorithms Architectures Arrays Big Data Bootstrapping C++ Cache Partitioning Cancer Careers Chisel Communication Computer Architecture CTF DIABLO Efficiency Energy FPGA GAP Gaussian Elimination Genomics GPU Hardware HLS Lower Bounds LU Matrix Multiplication Memory Multicore Oblivious Open Space OS Parallelism Parallel Reduction Performance PHANTOM Processors Python Research Centers RISC-V SEJITS Tall-Skinny QR Technical Report Test generation