A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:

Blake Foster, Sridhar Mahadevan, Rui Wang

A GPU-based Approximate SVD Algorithm

University of Massachusetts Amherst

2

Introduction

Singular Value Decomposition (SVD)

• 𝑨: 𝑚 × 𝑛 matrix (𝑚 ≥ 𝑛)

• 𝑼, 𝑽: orthogonal matrices

• 𝜮: diagonal matrix

Exact SVD can be computed in 𝑶(𝒎𝒏𝟐) time.

3

Introduction

Approximate SVD

• 𝑨: 𝑚 × 𝑛 matrix, whose intrinsic dimensionality is far less than 𝑛

Usually sufficient to compute the 𝑘 (≪ 𝑛) largest singular values and corresponding vectors.

Approximate SVD can be solved in 𝑶(𝒎𝒏𝒌) time ( compared to 𝑶(𝒎𝒏𝟐) for full svd).

4

Introduction

Sample-based Approximate SVD

• Construct a basis 𝑽 by directly sampling or taking linear combinations of rows from 𝑨.

• An approximate SVD can be extracted from the projection of 𝑨 onto 𝑽 .

5

Introduction

Sample-based Approximate SVD

• Construct a basis 𝑽 by directly sampling or taking linear combinations of rows from 𝑨.

• An approximate SVD can be extracted from the projection of 𝑨 onto 𝑽 .

Sampling Methods

• Length-squared [Frieze et al. 2004, Deshpande et al. 2006]

• Random projection [Friedland et al 2006, Sarlos et al. 2006, Rokhlin et al. 2009]

6

Introduction

QUIC-SVD [Holmes et al. NIPS 2008]

• Sampled-based approximate SVD.

• Progressively constructs a basis 𝑽 using a cosine tree.

• Results have controllable L2 error basis construction adapts to the matrix and the desired error.

7

Our Goal

Map QUIC-SVD onto the GPU (Graphics Processing Units)

• Powerful data-parallel computation platform.

• High computation density, high memory bandwidth.

• Relatively low-cost.

NVIDIA GTX 580 512 cores 1.6 Tera FLOPs 3 GB memory 200 GB/s bandwidth $549

8

Related Work

OpenGL-based: rasterization, framebuffers, shaders

• [Galoppo et al. 2005]

• [Bondhugula et al. 2006]

CUDA-based:

• Matrix factorizations [Volkov et al. 2008]

• QR decomposition [Kerr et al. 2009]

• SVD [Lahabar et al. 2009]

• Commercial GPU linear algebra package [CULA 2010]

9

Our Work

A CUDA implementation of QUIC-SVD.

Up to 𝟔~𝟕 times speedup over an optimized CPU version.

A partitioned version that can process large, out-of-core matrices on a single GPU

• The largest we tested are 22,000 x 22,000 matrices (double precision FP).

10

Overview of QUIC-SVD

[Holmes et al. 2008]

𝑽

𝑨

11


Start from a root node that owns the entire matrix (all rows).


𝑨

12


Select a pivot row by importance sampling the squared magnitudes of the rows.


pivot row →

𝑨

13


Compute inner product of every row with the pivot row. (this is why it’s called Cosine Tree)


pivot row →

𝑨

14


Based on the inner products, the rows are partitioned into 2 subsets, each represented by a child node.


𝑨

Closer to zero →

T𝐡𝐞 𝐫𝐞𝐬𝐭 →

15


Compute the row mean (average) of each subset; add it to the current basis set (keep orthogonalized).


𝑨

𝑽

16


Repeat, until the estimated L2 error is below a threshold. The basis set now provides a good low-rank approximation.


𝑨

𝑽

17


Monte Carlo Error Estimation:

Input: any subset of rows 𝑨𝑠, the basis 𝑽 , 𝛿 ∈ [0,1]

Output: estimated , s.t. with probability at least 1 − 𝛿, the L2 error:

Termination Criteria:

18


5 Steps:

1. Select a leaf node with maximum estimated L2 error;

2. Split the node and create two child nodes;

3. Calculate and insert the mean vector of each child to the basis set 𝑉 (replacing the parent vector), orthogonalized;

4. Estimate the error of newly added child nodes;

5. Repeat steps 1—4 until the termination criteria is satisfied.

19

GPU Implementation of QUIC-SVD

Computation cost mostly on constructing the tree.

Most expensive steps:

• Computing vector inner products and row means

• Basis orthogonalization

20

Parallelize Vector Inner Products and Row Means

Compute all inner products and row means in parallel.

Assume a large matrix, there are sufficient number of rows to engage all GPU threads.

An index array is used to point to the physical rows.

21

Parallelize Gram Schmidt Orthogonalization

Classical Gram Schmidt: Assume are in the current basis set (i.e. they are orthonormal vectors), to add a new vector :

This is easy to parallelize (as each projection is independently calculated), but it has poor numerical stability.

1 2 3, , ... nv v v v

r

1 2( , ) ( , ) ... ( , )nr r proj r v proj r v proj r v

( , ) ( )proj r v r v v1n

rv

r

where

22


Modified Gram Schmidt:

This has good numerical stability, but is hard to parallelize, as the computation is sequential!

1 1

2 1 1 2

1 1

( , ),

( , ),

... ...

( , )k k k k

r r proj r v

r r proj r v

r r r proj r v

1nv r r

23


Our approach: Blocked Gram Schmidt

This hybrid approach combines the advantages of the previous two – numerically stable, and GPU-friendly.

(1)

1 2

(2) (1) (1) (1) (1)

1 2 2

( / ) ( / ) ( / ) ( / )

1 2 1 2 2

( , ) ( , ) ... ( , )

( , ) ( , ) ... ( , )

... ...

( , ) ( , ) ... ( , )

m

m m m

n m n m n m n m

k m m k

v v proj v u proj v u proj v u

v v proj v u proj v u proj v u

u v proj v u proj v u proj v u

24

Extract the Final Result

Input: matrix 𝑨, basis set 𝑽

Output: 𝑼,𝜮, 𝑽, the best approximate SVD of 𝑨 within the subspace 𝑽 .

Algorithm:

• Compute 𝑨𝑽 , then (𝑨𝑽 )𝑻𝑨𝑽 , and SVD 𝑼′𝜮′𝑽′𝑻 = (𝑨𝑽 )𝑻𝑨𝑽

• Let 𝑽 = 𝑽 𝑽′, 𝜮 = (𝜮′)𝟏/𝟐, 𝑼 = (𝑨𝑽 )𝑽′𝜮−𝟏,

• Return 𝑼,𝜮, 𝑽.

25

Extract the Final Result

Input: matrix 𝑨, subspace basis 𝑽

Output: 𝑼,𝜮, 𝑽, the best approximate SVD of 𝑨 within the subspace 𝑽 .

Algorithm:

• Compute 𝑨𝑽 , then (𝑨𝑽 )𝑻𝑨𝑽 , and SVD 𝑼′𝜮′𝑽′𝑻 = (𝑨𝑽 )𝑻𝑨𝑽

• Let 𝑽 = 𝑽 𝑽′, 𝜮 = (𝜮′)𝟏/𝟐, 𝑼 = (𝑨𝑽 )𝑽′𝜮−𝟏,

• Return 𝑼,𝜮, 𝑽.

SVD of a 𝒌 × 𝒌 matrix.

26

Partitioned QUIC-SVD

The advantage of the GPU is only obvious for large-scale problems, e.g. matrices larger than 10,0002.

For dense matrices, this will soon become an out-of-core problem.

We modified our algorithm to use a matrix partitioning scheme, so that each partition can fit in GPU memory.

27


Partition the matrix into fixed-size blocks

𝑨𝟏

𝑨𝟐

𝑨𝒔

… …

… …

28


Partition the matrix into fixed-size blocks

𝑨𝟏

𝑨𝟐

𝑨𝒔

… …

… …

29


30


Termination Criteria is guaranteed:

31

Implementation Details

Main implementation done in CUDA.

Use CULA library to extract the final SVD (this involves processing a 𝑘 × 𝑘 matrix, where 𝑘 ≪ 𝑛).

Selection of the splitting node is done using a priority queue, maintained on the CPU side.

Source code will be made available on our site http://graphics.cs.umass.edu

32

Performance

Test environment:

• CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory

• GPU: NVIDIA GTX 480 GPU

Test data:

• Given a rank 𝑘 and matrix size 𝑛, we generate a 𝑛 × 𝑘 left matrix and 𝑘 × 𝑛 right matrix, both filled with random numbers; the product of the two gives the test matrix.

Result accuracy:

• Both CPU and GPU versions use double floats; termination error 𝜖 = 10−12, Monte Carlo estimation 𝛿 = 10−12.

33

Performance

Comparisons:

1. Matlab’s svds;

2. CPU-based QUIC-SVD (optimized, multi-threaded, implemented using Intel Math Kernel Library);

3. GPU-based QUIC-SVD.

4. Tygert SVD (approximate SVD algorithm, implemented in Matlab, based on random projection);

34

Performance – Running Time

35

Performance – Speedup Factors

36

Performance – Speedup Factors

37

Performance – CPU vs. GPU QUIC-SVD

Running Time Speedup

38

Performance – GPU QUIC-SVD vs. Tygert SVD

Running Time Speedup

39

Integration with Manifold Alignment Framework

40

Conclusion

A fast GPU-based implementation of an approximate SVD algorithm.

Reasonable (up to 6~7 ×) speedup over an optimized CPU implementation, and Tygert SVD.

Our partitioned version allows processing out-of-core matrices.

41

Future Work

Be able to process sparse matrices.

Application in solving large graph Laplacians.

Components from this project can become building blocks for other algorithms: random projection trees, diffusion wavelets etc.

42

Acknowledgement

Alexander Gray

PPAM reviewers

NSF Grant FODAVA-1025120

Documents

A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data: