Communication, i.e., moving data, either between levels of a memory hierarchy or between
processors over a network, is much more expensive (in time or energy) than arithmetic. So there
has been much recent work designing algorithms that minimize communication, when possible
attaining lower bounds on the total number of reads and writes.
However, most of this previous work does not distinguish between the costs of reads and
writes. Writes can be much more expensive than reads in some current and emerging tech-
nologies. The rst example is nonvolatile memory, such as Flash and Phase Change Memory.
Second, in cloud computing frameworks like MapReduce, Hadoop, and Spark, intermediate re-
sults may be written to disk for reliability purposes, whereas read-only data may be kept in
DRAM. Third, in a shared memory environment, writes may result in more coherency trac
over a shared bus than reads.
This motivates us to rst ask whether there are lower bounds on the number of writes
that certain algorithms must perform, and when these bounds are asymptotically smaller than
bounds on the sum of reads and writes together. When these smaller lower bounds exist, we
then ask when they are attainable; we call such algorithms \write-avoiding (WA)”, to distinguish
them from \communication-avoiding (CA)” algorithms, which only minimize the sum of reads
and writes. We identify a number of cases in linear algebra and direct N-body methods where
known CA algorithms are also WA (some are and some aren’t). We also identify classes of
algorithms, including Strassen’s matrix multiplication, Cooley-Tukey FFT, and cache oblivious
algorithms for classical linear algebra, where a WA algorithm cannot exist: the number of
writes is unavoidably high, within a constant factor of the total number of reads and writes.
We explore the interaction of WA algorithms with cache replacement policies and argue that
the Least Recently Used (LRU) policy works well with the WA algorithms in this paper. We
provide empirical hardware counter measurements from Intel’s Nehalem-EX microarchitecture
to validate our theory. In the parallel case, for classical linear algebra, we show that it is
impossible to attain lower bounds both on interprocessor communication and on writes to local
memory, but either one is attainable by itself. Finally, we discuss WA algorithms for sparse
iterative linear algebra. We show that, for sequential communication-avoiding Krylov subspace
methods, which can perform s iterations of the conventional algorithm for the communication
cost of 1 classical iteration, it is possible to reduce the number of writes by a factor of (s)
by interleaving a matrix powers computation with orthogonalization operations in a blockwise