Einladung zum Gastvortrag am


Freitag, 16. Juni 2016,

16:00 Uhr, T01

Institutsgebäude Jakob-Haringer-Str. 2, Itzling


von Dr. Marc Gates

(Innovative Computing Laboratory, Electrical Engineering and Computer Science Department University of

Tennessee Knoxville, Tennessee, USA)


zum Thema:

Evolution of the Singular Value

Decomposition Towards Exascale



The computation of the Singular Value Decomposition (SVD) has a long history, with many improvements over

the years, both in implementations and algorithmically. In this talk, we survey the evolution of SVD algorithms for

dense matrices, discussing the motivation and performance impact of changes. As we move towards exascale

computing, improving the algorithm to remove bottlenecks---particularly associated with memory bandwidth

limitations---becomes critically important.


One of the first efficient and stable SVD algorithms was QR iteration, by Golub and Reinsch, written in Algol60.

This was subsequently translated to Fortran in the EISPACK library, and later more efficiently implemented in the

LINPACK library using Level~1 BLAS for the vector machines of that day.


To address the emergence of cache-based memory hierarchies, new algorithms were developed to block

operations together using Level~3 BLAS, resulting in the LAPACK library. Special cases for tall and wide matrices

also significantly improved performance. Use of multi-threaded BLAS enabled parallelism within a shared-memory

machine. To take advantage of distributed memory machines, ScaLAPACK was introduced, based on the

LAPACK design but using the parallel PBLAS, built on top of message passing. Algorithmically, the divide

and conquer method was introduced to replace the traditional QR iteration method, significantly speeding up

computation of singular vectors. With the emergence of accelerators, such as GPUs and the Intel Xeon Phi,

implementations sought to offload the computationally intensive workload to the accelerator.


Still, methods remained memory bound, limited by matrix-vector multiplies in the initial reduction to bidiagonal

form. As the gap between memory bandwidth and computation speed increased, this became more of a

bottleneck. In recent years, two stage algorithms have sought to alleviate this limitation by first reducing the matrix

to a band form using Level~3 BLAS, then reducing to bidiagonal form using a bulge-chasing technique.


As we see, SVD algorithms have continually evolved to address changes in computer hardware, as well as

advancements in mathematics. This talk will look at the improvements of each implementation, comparing

performance on a common, modern, shared-memory machine and a distributed memory platform. We

show that algorithmic and implementation improvements have increased the speed of the SVD from 100--1000 times, while using 25 times less energy.