Write-Avoiding Algorithms

Communication, i.e., moving data, either between levels of a memory hierarchy or between

processors over a network, is much more expensive (in time or energy) than arithmetic. So there

has been much recent work designing algorithms that minimize communication, when possible

attaining lower bounds on the total number of reads and writes.

However, most of this previous work does not distinguish between the costs of reads and

writes. Writes can be much more expensive than reads in some current and emerging tech-

nologies. The rst example is nonvolatile memory, such as Flash and Phase Change Memory.

Second, in cloud computing frameworks like MapReduce, Hadoop, and Spark, intermediate re-

sults may be written to disk for reliability purposes, whereas read-only data may be kept in

DRAM. Third, in a shared memory environment, writes may result in more coherency trac

over a shared bus than reads.

This motivates us to rst ask whether there are lower bounds on the number of writes

that certain algorithms must perform, and when these bounds are asymptotically smaller than

bounds on the sum of reads and writes together. When these smaller lower bounds exist, we

then ask when they are attainable; we call such algorithms \write-avoiding (WA)”, to distinguish

them from \communication-avoiding (CA)” algorithms, which only minimize the sum of reads

and writes. We identify a number of cases in linear algebra and direct N-body methods where

known CA algorithms are also WA (some are and some aren’t). We also identify classes of

algorithms, including Strassen’s matrix multiplication, Cooley-Tukey FFT, and cache oblivious

algorithms for classical linear algebra, where a WA algorithm cannot exist: the number of

writes is unavoidably high, within a constant factor of the total number of reads and writes.

We explore the interaction of WA algorithms with cache replacement policies and argue that

the Least Recently Used (LRU) policy works well with the WA algorithms in this paper. We

provide empirical hardware counter measurements from Intel’s Nehalem-EX microarchitecture

to validate our theory. In the parallel case, for classical linear algebra, we show that it is

impossible to attain lower bounds both on interprocessor communication and on writes to local

memory, but either one is attainable by itself. Finally, we discuss WA algorithms for sparse

iterative linear algebra. We show that, for sequential communication-avoiding Krylov subspace

methods, which can perform s iterations of the conventional algorithm for the communication

cost of 1 classical iteration, it is possible to reduce the number of writes by a factor of (s)

by interleaving a matrix powers computation with orthogonalization operations in a blockwise

fashion.

Technical Report