Einladung zum Gastvortrag am
Freitag, 16. Juni 2016,
16:00 Uhr, T01
Institutsgebäude Jakob-Haringer-Str. 2, Itzling
von Dr. Marc Gates
(Innovative Computing Laboratory, Electrical Engineering and Computer Science Department University of
Tennessee Knoxville, Tennessee, USA)
Evolution of the Singular Value
Decomposition Towards Exascale
The computation of the Singular Value Decomposition (SVD) has a long history, with many improvements over
the years, both in implementations and algorithmically. In this talk, we survey the evolution of SVD algorithms for
dense matrices, discussing the motivation and performance impact of changes. As we move towards exascale
computing, improving the algorithm to remove bottlenecks---particularly associated with memory bandwidth
limitations---becomes critically important.
One of the first efficient and stable SVD algorithms was QR iteration, by Golub and Reinsch, written in Algol60.
This was subsequently translated to Fortran in the EISPACK library, and later more efficiently implemented in the
LINPACK library using Level~1 BLAS for the vector machines of that day.
To address the emergence of cache-based memory hierarchies, new algorithms were developed to block
operations together using Level~3 BLAS, resulting in the LAPACK library. Special cases for tall and wide matrices
also significantly improved performance. Use of multi-threaded BLAS enabled parallelism within a shared-memory
machine. To take advantage of distributed memory machines, ScaLAPACK was introduced, based on the
LAPACK design but using the parallel PBLAS, built on top of message passing. Algorithmically, the divide
and conquer method was introduced to replace the traditional QR iteration method, significantly speeding up
computation of singular vectors. With the emergence of accelerators, such as GPUs and the Intel Xeon Phi,
implementations sought to offload the computationally intensive workload to the accelerator.
Still, methods remained memory bound, limited by matrix-vector multiplies in the initial reduction to bidiagonal
form. As the gap between memory bandwidth and computation speed increased, this became more of a
bottleneck. In recent years, two stage algorithms have sought to alleviate this limitation by first reducing the matrix
to a band form using Level~3 BLAS, then reducing to bidiagonal form using a bulge-chasing technique.
As we see, SVD algorithms have continually evolved to address changes in computer hardware, as well as
advancements in mathematics. This talk will look at the improvements of each implementation, comparing
performance on a common, modern, shared-memory machine and a distributed memory platform. We
show that algorithmic and implementation improvements have increased the speed of the SVD from 100--1000 times, while using 25 times less energy.