We consider distributed memory algorithms for the all-pairs shortest paths (APSP) problem. Scaling
the APSP problem to high concurrencies requires both minimizing inter-processor communication as
well as maximizing temporal data locality. Our 2.5D APSP algorithm, which is based on the divideand-
conquer paradigm, satisfies both of these requirements: it can utilize the extra available memory to
perform asymptotically less communication, and it is rich in semiring matrix multiplications, which have
high temporal locality. We start by introducing a block-cyclic 2D (minimal memory) APSP algorithm.
With a careful choice of block-size, this algorithm achieves known communication lower-bounds on
latency and bandwidth. We extend this 2D block-cyclic algorithm to a 2.5D algorithm, which can use
c extra copies of data to reduce the bandwidth cost by a factor of c1=2, compared to its 2D counterpart.
However, the 2.5D algorithm increases the latency cost by c1=2. We provide a tighter lower bound on
latency, which dictates that the latency overhead is necessary to reduce bandwidth along the critical
path of execution. Our implementation achieves impressive performance and scaling to 24,576 cores
of a Cray XE6 supercomputer by utilizing well-tuned intra-node kernels within the distributed memory