42
Blake Foster, Sridhar Mahadevan, Rui Wang A GPU-based Approximate SVD Algorithm University of Massachusetts Amherst

A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:

Blake Foster, Sridhar Mahadevan, Rui Wang

A GPU-based Approximate SVD Algorithm

University of Massachusetts Amherst

Page 2: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:

2

Introduction

Singular Value Decomposition (SVD)

• 𝑨: 𝑚 × 𝑛 matrix (𝑚 ≥ 𝑛)

• 𝑼, 𝑽: orthogonal matrices

• 𝜮: diagonal matrix

Exact SVD can be computed in 𝑶(𝒎𝒏𝟐) time.

Page 3: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:

3

Introduction

Approximate SVD

• 𝑨: 𝑚 × 𝑛 matrix, whose intrinsic dimensionality is far less than 𝑛

Usually sufficient to compute the 𝑘 (≪ 𝑛) largest singular values and corresponding vectors.

Approximate SVD can be solved in 𝑶(𝒎𝒏𝒌) time ( compared to 𝑶(𝒎𝒏𝟐) for full svd).

Page 4: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:

4

Introduction

Sample-based Approximate SVD

• Construct a basis 𝑽 by directly sampling or taking linear combinations of rows from 𝑨.

• An approximate SVD can be extracted from the projection of 𝑨 onto 𝑽 .

Page 5: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:

5

Introduction

Sample-based Approximate SVD

• Construct a basis 𝑽 by directly sampling or taking linear combinations of rows from 𝑨.

• An approximate SVD can be extracted from the projection of 𝑨 onto 𝑽 .

Sampling Methods

• Length-squared [Frieze et al. 2004, Deshpande et al. 2006]

• Random projection [Friedland et al 2006, Sarlos et al. 2006, Rokhlin et al. 2009]

Page 6: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:

6

Introduction

QUIC-SVD [Holmes et al. NIPS 2008]

• Sampled-based approximate SVD.

• Progressively constructs a basis 𝑽 using a cosine tree.

• Results have controllable L2 error basis construction adapts to the matrix and the desired error.

Page 7: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:

7

Our Goal

Map QUIC-SVD onto the GPU (Graphics Processing Units)

• Powerful data-parallel computation platform.

• High computation density, high memory bandwidth.

• Relatively low-cost.

NVIDIA GTX 580 512 cores 1.6 Tera FLOPs 3 GB memory 200 GB/s bandwidth $549

Page 8: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:

8

Related Work

OpenGL-based: rasterization, framebuffers, shaders

• [Galoppo et al. 2005]

• [Bondhugula et al. 2006]

CUDA-based:

• Matrix factorizations [Volkov et al. 2008]

• QR decomposition [Kerr et al. 2009]

• SVD [Lahabar et al. 2009]

• Commercial GPU linear algebra package [CULA 2010]

Page 9: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:

9

Our Work

A CUDA implementation of QUIC-SVD.

Up to 𝟔~𝟕 times speedup over an optimized CPU version.

A partitioned version that can process large, out-of-core matrices on a single GPU

• The largest we tested are 22,000 x 22,000 matrices (double precision FP).

Page 10: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:

10

Overview of QUIC-SVD

[Holmes et al. 2008]

𝑽

𝑨

Page 11: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:

11

Overview of QUIC-SVD

Start from a root node that owns the entire matrix (all rows).

[Holmes et al. 2008]

𝑨

Page 12: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:

12

Overview of QUIC-SVD

Select a pivot row by importance sampling the squared magnitudes of the rows.

[Holmes et al. 2008]

pivot row →

𝑨

Page 13: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:

13

Overview of QUIC-SVD

Compute inner product of every row with the pivot row. (this is why it’s called Cosine Tree)

[Holmes et al. 2008]

pivot row →

𝑨

Page 14: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:

14

Overview of QUIC-SVD

Based on the inner products, the rows are partitioned into 2 subsets, each represented by a child node.

[Holmes et al. 2008]

𝑨

Closer to zero →

T𝐡𝐞 𝐫𝐞𝐬𝐭 →

Page 15: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:

15

Overview of QUIC-SVD

Compute the row mean (average) of each subset; add it to the current basis set (keep orthogonalized).

[Holmes et al. 2008]

𝑨

𝑽

Page 16: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:

16

Overview of QUIC-SVD

Repeat, until the estimated L2 error is below a threshold. The basis set now provides a good low-rank approximation.

[Holmes et al. 2008]

𝑨

𝑽

Page 17: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:

17

Overview of QUIC-SVD

Monte Carlo Error Estimation:

Input: any subset of rows 𝑨𝑠, the basis 𝑽 , 𝛿 ∈ [0,1]

Output: estimated , s.t. with probability at least 1 − 𝛿, the L2 error:

Termination Criteria:

Page 18: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:

18

Overview of QUIC-SVD

5 Steps:

1. Select a leaf node with maximum estimated L2 error;

2. Split the node and create two child nodes;

3. Calculate and insert the mean vector of each child to the basis set 𝑉 (replacing the parent vector), orthogonalized;

4. Estimate the error of newly added child nodes;

5. Repeat steps 1—4 until the termination criteria is satisfied.

Page 19: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:

19

GPU Implementation of QUIC-SVD

Computation cost mostly on constructing the tree.

Most expensive steps:

• Computing vector inner products and row means

• Basis orthogonalization

Page 20: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:

20

Parallelize Vector Inner Products and Row Means

Compute all inner products and row means in parallel.

Assume a large matrix, there are sufficient number of rows to engage all GPU threads.

An index array is used to point to the physical rows.

Page 21: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:

21

Parallelize Gram Schmidt Orthogonalization

Classical Gram Schmidt: Assume are in the current basis set (i.e. they are orthonormal vectors), to add a new vector :

This is easy to parallelize (as each projection is independently calculated), but it has poor numerical stability.

1 2 3, , ... nv v v v

r

1 2( , ) ( , ) ... ( , )nr r proj r v proj r v proj r v

( , ) ( )proj r v r v v1n

rv

r

where

Page 22: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:

22

Parallelize Gram Schmidt Orthogonalization

Modified Gram Schmidt:

This has good numerical stability, but is hard to parallelize, as the computation is sequential!

1 1

2 1 1 2

1 1

( , ),

( , ),

... ...

( , )k k k k

r r proj r v

r r proj r v

r r r proj r v

1nv r r

Page 23: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:

23

Parallelize Gram Schmidt Orthogonalization

Our approach: Blocked Gram Schmidt

This hybrid approach combines the advantages of the previous two – numerically stable, and GPU-friendly.

(1)

1 2

(2) (1) (1) (1) (1)

1 2 2

( / ) ( / ) ( / ) ( / )

1 2 1 2 2

( , ) ( , ) ... ( , )

( , ) ( , ) ... ( , )

... ...

( , ) ( , ) ... ( , )

m

m m m

n m n m n m n m

k m m k

v v proj v u proj v u proj v u

v v proj v u proj v u proj v u

u v proj v u proj v u proj v u

Page 24: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:

24

Extract the Final Result

Input: matrix 𝑨, basis set 𝑽

Output: 𝑼,𝜮, 𝑽, the best approximate SVD of 𝑨 within the subspace 𝑽 .

Algorithm:

• Compute 𝑨𝑽 , then (𝑨𝑽 )𝑻𝑨𝑽 , and SVD 𝑼′𝜮′𝑽′𝑻 = (𝑨𝑽 )𝑻𝑨𝑽

• Let 𝑽 = 𝑽 𝑽′, 𝜮 = (𝜮′)𝟏/𝟐, 𝑼 = (𝑨𝑽 )𝑽′𝜮−𝟏,

• Return 𝑼,𝜮, 𝑽.

Page 25: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:

25

Extract the Final Result

Input: matrix 𝑨, subspace basis 𝑽

Output: 𝑼,𝜮, 𝑽, the best approximate SVD of 𝑨 within the subspace 𝑽 .

Algorithm:

• Compute 𝑨𝑽 , then (𝑨𝑽 )𝑻𝑨𝑽 , and SVD 𝑼′𝜮′𝑽′𝑻 = (𝑨𝑽 )𝑻𝑨𝑽

• Let 𝑽 = 𝑽 𝑽′, 𝜮 = (𝜮′)𝟏/𝟐, 𝑼 = (𝑨𝑽 )𝑽′𝜮−𝟏,

• Return 𝑼,𝜮, 𝑽.

SVD of a 𝒌 × 𝒌 matrix.

Page 26: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:

26

Partitioned QUIC-SVD

The advantage of the GPU is only obvious for large-scale problems, e.g. matrices larger than 10,0002.

For dense matrices, this will soon become an out-of-core problem.

We modified our algorithm to use a matrix partitioning scheme, so that each partition can fit in GPU memory.

Page 27: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:

27

Partitioned QUIC-SVD

Partition the matrix into fixed-size blocks

𝑨𝟏

𝑨𝟐

𝑨𝒔

… …

… …

Page 28: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:

28

Partitioned QUIC-SVD

Partition the matrix into fixed-size blocks

𝑨𝟏

𝑨𝟐

𝑨𝒔

… …

… …

Page 29: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:

29

Partitioned QUIC-SVD

Page 30: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:

30

Partitioned QUIC-SVD

Termination Criteria is guaranteed:

Page 31: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:

31

Implementation Details

Main implementation done in CUDA.

Use CULA library to extract the final SVD (this involves processing a 𝑘 × 𝑘 matrix, where 𝑘 ≪ 𝑛).

Selection of the splitting node is done using a priority queue, maintained on the CPU side.

Source code will be made available on our site http://graphics.cs.umass.edu

Page 32: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:

32

Performance

Test environment:

• CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory

• GPU: NVIDIA GTX 480 GPU

Test data:

• Given a rank 𝑘 and matrix size 𝑛, we generate a 𝑛 × 𝑘 left matrix and 𝑘 × 𝑛 right matrix, both filled with random numbers; the product of the two gives the test matrix.

Result accuracy:

• Both CPU and GPU versions use double floats; termination error 𝜖 = 10−12, Monte Carlo estimation 𝛿 = 10−12.

Page 33: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:

33

Performance

Comparisons:

1. Matlab’s svds;

2. CPU-based QUIC-SVD (optimized, multi-threaded, implemented using Intel Math Kernel Library);

3. GPU-based QUIC-SVD.

4. Tygert SVD (approximate SVD algorithm, implemented in Matlab, based on random projection);

Page 34: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:

34

Performance – Running Time

Page 35: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:

35

Performance – Speedup Factors

Page 36: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:

36

Performance – Speedup Factors

Page 37: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:

37

Performance – CPU vs. GPU QUIC-SVD

Running Time Speedup

Page 38: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:

38

Performance – GPU QUIC-SVD vs. Tygert SVD

Running Time Speedup

Page 39: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:

39

Integration with Manifold Alignment Framework

Page 40: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:

40

Conclusion

A fast GPU-based implementation of an approximate SVD algorithm.

Reasonable (up to 6~7 ×) speedup over an optimized CPU implementation, and Tygert SVD.

Our partitioned version allows processing out-of-core matrices.

Page 41: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:

41

Future Work

Be able to process sparse matrices.

Application in solving large graph Laplacians.

Components from this project can become building blocks for other algorithms: random projection trees, diffusion wavelets etc.

Page 42: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:

42

Acknowledgement

Alexander Gray

PPAM reviewers

NSF Grant FODAVA-1025120