Abstract

â€”Reproducibility is the ability to obtain bitwise

identical results from different runs of the same program on

the same input data, regardless of the available computing

resources, or how they are scheduled. Recently, techniques have

been proposed to attain reproducibility for BLAS operations

[1], [2], [5], all of which rely on reproducibly computing

the floating-point sum and dot product. Nonetheless, a repro-

ducible BLAS library does not automatically translate into

a reproducible higher-level linear algebra library, especially

when communication is optimized. For instance, for the QR

factorization, conventional algorithms such as Householder

transformation or Gram-Schmidt process can be used to re-

producibly factorize a floating-point matrix by fixing the high-

level order of computation, for example column-by-column

from left to right, and by using reproducible versions of level-

1 BLAS operations such as dot product and 2-norm. In a

massively parallel environment, those algorithms have high

communication cost due to the need for synchronization after

each step. The Tall-Skinny QR algorithm obtains much better

performance in massively parallel environments by reducing

the number of messages by a factor of

n to O (log( P)) where P

is the processor count, by reducing the number of reduction

operations to O (1) . Those reduction operations however are

highly dependent on the network topology, in particular the

number of computing nodes, and therefore are difficult to

implement reproducibly and with reasonable performance. In

this paper we present a new technique to reproducibly compute

a QR factorization for a tall skinny matrix, which is based on

the Cholesky QR algorithm to attain reproducibility as well as

to improve communication cost, and the iterative refinement

technique to guarantee the accuracy of the computed results.

Our technique exhibits strong scalability in massively parallel

environments, and at the same time can provide results of

almost the same accuracy as the conventional Householder QR

algorithm unless the matrix is extremely badly conditioned,

in which case a warning can be given. Initial experimental

results in Matlab show that for not too ill-conditioned matrices

whose condition number is smaller than sqrt(1/e) where e is

the machine epsilon, our technique runs less than 4 times

slower than the built-in Matlab qr()

function, and always

computes numerically stable results in terms of column-wise

relative error.

## Publications

## Tags

2D
Accelerators
Algorithms
Architectures
Arrays
Big Data
Bootstrapping
C++
Cache Partitioning
Cancer
Careers
Chisel
Communication
Computer Architecture
CTF
DIABLO
Efficiency
Energy
FPGA
GAP
Gaussian Elimination
Genomics
GPU
Hardware
HLS
Lower Bounds
LU
Matrix Multiplication
Memory
Multicore
Oblivious
Open Space
OS
Parallelism
Parallel Reduction
Performance
PHANTOM
Processors
Python
Research Centers
RISC-V
SEJITS
Tall-Skinny QR
Technical Report
Test generation