Abstract
—Reproducibility is the ability to obtain bitwise
identical results from different runs of the same program on
the same input data, regardless of the available computing
resources, or how they are scheduled. Recently, techniques have
been proposed to attain reproducibility for BLAS operations
[1], [2], [5], all of which rely on reproducibly computing
the floating-point sum and dot product. Nonetheless, a repro-
ducible BLAS library does not automatically translate into
a reproducible higher-level linear algebra library, especially
when communication is optimized. For instance, for the QR
factorization, conventional algorithms such as Householder
transformation or Gram-Schmidt process can be used to re-
producibly factorize a floating-point matrix by fixing the high-
level order of computation, for example column-by-column
from left to right, and by using reproducible versions of level-
1 BLAS operations such as dot product and 2-norm. In a
massively parallel environment, those algorithms have high
communication cost due to the need for synchronization after
each step. The Tall-Skinny QR algorithm obtains much better
performance in massively parallel environments by reducing
the number of messages by a factor of
n to O (log( P)) where P
is the processor count, by reducing the number of reduction
operations to O (1) . Those reduction operations however are
highly dependent on the network topology, in particular the
number of computing nodes, and therefore are difficult to
implement reproducibly and with reasonable performance. In
this paper we present a new technique to reproducibly compute
a QR factorization for a tall skinny matrix, which is based on
the Cholesky QR algorithm to attain reproducibility as well as
to improve communication cost, and the iterative refinement
technique to guarantee the accuracy of the computed results.
Our technique exhibits strong scalability in massively parallel
environments, and at the same time can provide results of
almost the same accuracy as the conventional Householder QR
algorithm unless the matrix is extremely badly conditioned,
in which case a warning can be given. Initial experimental
results in Matlab show that for not too ill-conditioned matrices
whose condition number is smaller than sqrt(1/e) where e is
the machine epsilon, our technique runs less than 4 times
slower than the built-in Matlab qr()
function, and always
computes numerically stable results in terms of column-wise
relative error.
Publications
Tags
2D
Accelerators
Algorithms
Architectures
Arrays
Big Data
Bootstrapping
C++
Cache Partitioning
Cancer
Careers
Chisel
Communication
Computer Architecture
CTF
DIABLO
Efficiency
Energy
FPGA
GAP
Gaussian Elimination
Genomics
GPU
Hardware
HLS
Lower Bounds
LU
Matrix Multiplication
Memory
Multicore
Oblivious
Open Space
OS
Parallelism
Parallel Reduction
Performance
PHANTOM
Processors
Python
Research Centers
RISC-V
SEJITS
Tall-Skinny QR
Technical Report
Test generation