Reproducible Tall-Skinny QR

Abstract
—Reproducibility is the ability to obtain bitwise
identical results from different runs of the same program on
the same input data, regardless of the available computing
resources, or how they are scheduled. Recently, techniques have
been proposed to attain reproducibility for BLAS operations
[1], [2], [5], all of which rely on reproducibly computing
the floating-point sum and dot product. Nonetheless, a repro-
ducible BLAS library does not automatically translate into
a reproducible higher-level linear algebra library, especially
when communication is optimized. For instance, for the QR
factorization, conventional algorithms such as Householder
transformation or Gram-Schmidt process can be used to re-
producibly factorize a floating-point matrix by fixing the high-
level order of computation, for example column-by-column
from left to right, and by using reproducible versions of level-
1 BLAS operations such as dot product and 2-norm. In a
massively parallel environment, those algorithms have high
communication cost due to the need for synchronization after
each step. The Tall-Skinny QR algorithm obtains much better
performance in massively parallel environments by reducing
the number of messages by a factor of
n to O (log( P)) where P
is the processor count, by reducing the number of reduction
operations to O (1) . Those reduction operations however are
highly dependent on the network topology, in particular the
number of computing nodes, and therefore are difficult to
implement reproducibly and with reasonable performance. In
this paper we present a new technique to reproducibly compute
a QR factorization for a tall skinny matrix, which is based on
the Cholesky QR algorithm to attain reproducibility as well as
to improve communication cost, and the iterative refinement
technique to guarantee the accuracy of the computed results.
Our technique exhibits strong scalability in massively parallel
environments, and at the same time can provide results of
almost the same accuracy as the conventional Householder QR
algorithm unless the matrix is extremely badly conditioned,
in which case a warning can be given. Initial experimental
results in Matlab show that for not too ill-conditioned matrices
whose condition number is smaller than sqrt(1/e) where e is
the machine epsilon, our technique runs less than 4 times
slower than the built-in Matlab qr()
function, and always
computes numerically stable results in terms of column-wise
relative error.

Paper

Publication Date: June 2015
Conference: 22nd IEEE Symp. on Computer Arithmetic (ARITH'22), IEEE Computer Society, Jun 22-24, 2015