A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU:...

Blake Foster, Sridhar Mahadevan, Rui Wang

A GPU-based Approximate SVD Algorithm

University of Massachusetts Amherst

Introduction

Singular Value Decomposition (SVD)

• 𝑨: 𝑚 × 𝑛 matrix (𝑚 ≥ 𝑛)

• 𝑼, 𝑽: orthogonal matrices

• 𝜮: diagonal matrix

Exact SVD can be computed in 𝑶(𝒎𝒏𝟐) time.

Introduction

Approximate SVD

• 𝑨: 𝑚 × 𝑛 matrix, whose intrinsic dimensionality is far less than 𝑛

Usually sufficient to compute the 𝑘 (≪ 𝑛) largest singular values and corresponding vectors.

Approximate SVD can be solved in 𝑶(𝒎𝒏𝒌) time ( compared to 𝑶(𝒎𝒏𝟐) for full svd).

Introduction

Sample-based Approximate SVD

• Construct a basis 𝑽 by directly sampling or taking linear combinations of rows from 𝑨.

• An approximate SVD can be extracted from the projection of 𝑨 onto 𝑽 .

Introduction

Sample-based Approximate SVD

• Construct a basis 𝑽 by directly sampling or taking linear combinations of rows from 𝑨.

• An approximate SVD can be extracted from the projection of 𝑨 onto 𝑽 .

Sampling Methods

• Length-squared [Frieze et al. 2004, Deshpande et al. 2006]

• Random projection [Friedland et al 2006, Sarlos et al. 2006, Rokhlin et al. 2009]

Introduction

QUIC-SVD [Holmes et al. NIPS 2008]

• Sampled-based approximate SVD.

• Progressively constructs a basis 𝑽 using a cosine tree.

• Results have controllable L2 error basis construction adapts to the matrix and the desired error.

Our Goal

Map QUIC-SVD onto the GPU (Graphics Processing Units)

• Powerful data-parallel computation platform.

• High computation density, high memory bandwidth.

• Relatively low-cost.

NVIDIA GTX 580 512 cores 1.6 Tera FLOPs 3 GB memory 200 GB/s bandwidth $549

Related Work

OpenGL-based: rasterization, framebuffers, shaders

• [Galoppo et al. 2005]

• [Bondhugula et al. 2006]

CUDA-based:

• Matrix factorizations [Volkov et al. 2008]

• QR decomposition [Kerr et al. 2009]

• SVD [Lahabar et al. 2009]

• Commercial GPU linear algebra package [CULA 2010]

Our Work

A CUDA implementation of QUIC-SVD.

Up to 𝟔~𝟕 times speedup over an optimized CPU version.

A partitioned version that can process large, out-of-core matrices on a single GPU

• The largest we tested are 22,000 x 22,000 matrices (double precision FP).

Overview of QUIC-SVD

[Holmes et al. 2008]

Start from a root node that owns the entire matrix (all rows).

Select a pivot row by importance sampling the squared magnitudes of the rows.

pivot row →

Compute inner product of every row with the pivot row. (this is why it’s called Cosine Tree)

pivot row →

Based on the inner products, the rows are partitioned into 2 subsets, each represented by a child node.

Closer to zero →

T𝐡𝐞 𝐫𝐞𝐬𝐭 →

Compute the row mean (average) of each subset; add it to the current basis set (keep orthogonalized).

Repeat, until the estimated L2 error is below a threshold. The basis set now provides a good low-rank approximation.

Monte Carlo Error Estimation:

Input: any subset of rows 𝑨𝑠, the basis 𝑽 , 𝛿 ∈ [0,1]

Output: estimated , s.t. with probability at least 1 − 𝛿, the L2 error:

Termination Criteria:

5 Steps:

1. Select a leaf node with maximum estimated L2 error;

2. Split the node and create two child nodes;

3. Calculate and insert the mean vector of each child to the basis set 𝑉 (replacing the parent vector), orthogonalized;

4. Estimate the error of newly added child nodes;

5. Repeat steps 1—4 until the termination criteria is satisfied.

GPU Implementation of QUIC-SVD

Computation cost mostly on constructing the tree.

Most expensive steps:

• Computing vector inner products and row means

• Basis orthogonalization

Parallelize Vector Inner Products and Row Means

Compute all inner products and row means in parallel.

Assume a large matrix, there are sufficient number of rows to engage all GPU threads.

An index array is used to point to the physical rows.

Parallelize Gram Schmidt Orthogonalization

Classical Gram Schmidt: Assume are in the current basis set (i.e. they are orthonormal vectors), to add a new vector :

This is easy to parallelize (as each projection is independently calculated), but it has poor numerical stability.

1 2 3, , ... nv v v v

1 2( , ) ( , ) ... ( , )nr r proj r v proj r v proj r v

( , ) ( )proj r v r v v1n

Modified Gram Schmidt:

This has good numerical stability, but is hard to parallelize, as the computation is sequential!

2 1 1 2

( , ),

... ...

( , )k k k k

r r proj r v

r r r proj r v

1nv r r

Our approach: Blocked Gram Schmidt

This hybrid approach combines the advantages of the previous two – numerically stable, and GPU-friendly.

(2) (1) (1) (1) (1)

( / ) ( / ) ( / ) ( / )

1 2 1 2 2

( , ) ( , ) ... ( , )

... ...

( , ) ( , ) ... ( , )

n m n m n m n m

k m m k

v v proj v u proj v u proj v u

u v proj v u proj v u proj v u

Extract the Final Result

Input: matrix 𝑨, basis set 𝑽

Output: 𝑼,𝜮, 𝑽, the best approximate SVD of 𝑨 within the subspace 𝑽 .

Algorithm:

• Compute 𝑨𝑽 , then (𝑨𝑽 )𝑻𝑨𝑽 , and SVD 𝑼′𝜮′𝑽′𝑻 = (𝑨𝑽 )𝑻𝑨𝑽

• Let 𝑽 = 𝑽 𝑽′, 𝜮 = (𝜮′)𝟏/𝟐, 𝑼 = (𝑨𝑽 )𝑽′𝜮−𝟏,

• Return 𝑼,𝜮, 𝑽.

Extract the Final Result

Input: matrix 𝑨, subspace basis 𝑽

Output: 𝑼,𝜮, 𝑽, the best approximate SVD of 𝑨 within the subspace 𝑽 .

Algorithm:

• Compute 𝑨𝑽 , then (𝑨𝑽 )𝑻𝑨𝑽 , and SVD 𝑼′𝜮′𝑽′𝑻 = (𝑨𝑽 )𝑻𝑨𝑽

• Let 𝑽 = 𝑽 𝑽′, 𝜮 = (𝜮′)𝟏/𝟐, 𝑼 = (𝑨𝑽 )𝑽′𝜮−𝟏,

• Return 𝑼,𝜮, 𝑽.

SVD of a 𝒌 × 𝒌 matrix.

Partitioned QUIC-SVD

The advantage of the GPU is only obvious for large-scale problems, e.g. matrices larger than 10,0002.

For dense matrices, this will soon become an out-of-core problem.

We modified our algorithm to use a matrix partitioning scheme, so that each partition can fit in GPU memory.

Partition the matrix into fixed-size blocks

𝑨𝟏

𝑨𝟐

𝑨𝒔

… …

Partition the matrix into fixed-size blocks

𝑨𝟏

𝑨𝟐

𝑨𝒔

… …

Termination Criteria is guaranteed:

Implementation Details

Main implementation done in CUDA.

Use CULA library to extract the final SVD (this involves processing a 𝑘 × 𝑘 matrix, where 𝑘 ≪ 𝑛).

Selection of the splitting node is done using a priority queue, maintained on the CPU side.

Source code will be made available on our site http://graphics.cs.umass.edu

Performance

Test environment:

• CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory

• GPU: NVIDIA GTX 480 GPU

Test data:

• Given a rank 𝑘 and matrix size 𝑛, we generate a 𝑛 × 𝑘 left matrix and 𝑘 × 𝑛 right matrix, both filled with random numbers; the product of the two gives the test matrix.

Result accuracy:

• Both CPU and GPU versions use double floats; termination error 𝜖 = 10−12, Monte Carlo estimation 𝛿 = 10−12.

Performance

Comparisons:

1. Matlab’s svds;

2. CPU-based QUIC-SVD (optimized, multi-threaded, implemented using Intel Math Kernel Library);

3. GPU-based QUIC-SVD.

4. Tygert SVD (approximate SVD algorithm, implemented in Matlab, based on random projection);

Performance – Running Time

Performance – Speedup Factors

Performance – CPU vs. GPU QUIC-SVD

Running Time Speedup

Performance – GPU QUIC-SVD vs. Tygert SVD

Running Time Speedup

Integration with Manifold Alignment Framework

Conclusion

A fast GPU-based implementation of an approximate SVD algorithm.

Reasonable (up to 6~7 ×) speedup over an optimized CPU implementation, and Tygert SVD.

Our partitioned version allows processing out-of-core matrices.

Future Work

Be able to process sparse matrices.

Application in solving large graph Laplacians.

Components from this project can become building blocks for other algorithms: random projection trees, diffusion wavelets etc.

Acknowledgement

Alexander Gray

PPAM reviewers

NSF Grant FODAVA-1025120

A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU:...

Documents

HEATING SOLUTIONS LTD GTX manual - Gemtex · 2019. 8. 13. · GTX 200 200 GTX 300 300 GTX 500 500 GEMTEX CapacityExchangeCoil Power Primary Loss Continuous HeatEnergy Heater (L) Surface

~~~iVO Pro Series · Intel Core 2 Duo E8200 (2.66/6MB/1333FSB) RM1,255 Intel Core 2 Duo E8200 (2.66/6MB/1333FSB) RM1,339 Intel Core 2 Duo E8200 (2.66/6MB/1333FSB) RM1,419 Intel Core

PERFORMANCE CLIMBING SHOE COMPARISON CHART · TX2 TX2 Woman TX2 Leather TX2 Leather Woman TX3 GTX TX3 GTX Woman TX4 TX4 Woman TX4 GTX TX4 GTX Woman TX4 Mid GTX TX4 Mid GTX Woman TX5

Quickie GTX Quickie GTX - CDN for Southwestmedical.com · ment) exceeds 250 pounds for Quickie GTX Swing-Away or Quickie GTX Fixed Front (or 308 pounds for heavy duty option). If

Gtx Brochure

GTX 260 The Perfect GPU - multi.gnt.ltmulti.gnt.lt/Pages/addInfo/NVidia/GTX260_vs_4870_Reviews_v2.pdf · Which GPU Delivers the B Holiday Games of 2008? v GTX 260 (216 core) est Perfon

GeForce GTX 200 GPU Technical Brief - Nvidia

1999 SeaDoo GTX Limited Parts Catalogseadoomanuals.net/.../1999/1999-seadoo-gtx-ltd-parts.pdf · 2008-06-23 · 5888 – 5889 GTX Limited / Limitée – 01/99 219 300 740 A1 GTX Limited

Exploring Irregular Memory Access Applications on the GPUsuhailr/thesis.pdf · 5.7 ETT-based tree rooting and sublist size calculation (GPU ETT) on GTX 280 GPU compared to sequential

Engineering Gtx

ZOTAC GeForce GTX 1080 Ti Minihkftp.zotac.com/External/VGA/GTX10series/GTX1080Ti/... · NVIDIA’s new ˚agship GeForce® GTX 1080 Ti is the most advanced gaming GPU on the planet,

Safe · Hi-tech · Human · Dyeing Factory 通过ISO9001质量 ... Lifting height Max 75 130 150 200 Min 2.66 2.66 2.66 48 ... reduce the wear of the traction wire rope and prolong

Proton Testing of nVidia GTX 1050 GPU, Part 2€¦ · Proton Testing of nVidia GTX 1050 GPU, Part 2 . E.J. Wyrwas1 . NASA Goddard Space Flight Center . Code 561.4, Radiation Effects

EK-FC1080 GTX Ti - Nickel Plated Copper Water …...GeForce GTX 1080 Ti graphics card $139.99 Product Images 2 8/7/20 Short Description This water block directly cools the GPU, RAM,

GeForce GTX 980 Whitepaper - Microway€¦ · GeForce GTX 480—our first GPU built using the Fermi graphics architecture. GeForce GTX 480 incorporated fifteen parallel tessellation

Gainward GTX 580 1536MB Dual DVI - BT Shop€¦ · Gainward GTX 580 1536MB Dual DVI ... new API’s key graphics feature, GPU-accelerated tessellation. z NVIDIA PhysX TM: Full support

NVIDIA Confidential NVIDIA GEFORCE GTX 550 Ti. NVIDIA Confidential GEFORCE GTX GTX 480GTX 460

GTX MS Install

GTX Product Brochure GTX Series Integrated Motor / Actuator

A Discussion of CPU vs. GPU 1. CUDA Real “Hardware” Intel Core 2 Extreme QX9650 NVIDIA GeForce GTX 280 NVIDIA GeForce GTX 480 Transistors820 million1.4