We consider the problem of how to design and
implement communication-efficient versions of parallel support
vector machines, a widely used classifier in statistical machine
learning, for distributed memory clusters and supercomputers.
The main computational bottleneck is the training phase, in
which a statistical model is built from an input data set. Prior
to our study, the parallel isoefficiency of a state-of-the-art
implementation scaled as W =
(P3), where W is the problem
size and P the number of processors; this scaling is worse
than even a one-dimensional block row dense matrix vector
multiplication, which has W =
(P2).
This study considers a series of algorithmic refinements,
leading ultimately to a Communication-Avoiding SVM (CASVM)
method that improves the isoefficiency to nearly W =
(P). We evaluate these methods on 96 to 1536 processors,
and show average speedups of 3 ? 16 (7 on average)
over Dis-SMO, and a 95% weak-scaling efficiency on six realworld
datasets, with only modest losses in overall classification
accuracy. The source code can be downloaded at [1].