19
CSE 690: GPGPU Lecture 7: Matrix Multiplications Klaus Mueller Computer Science, Stony Brook University

CSE 690: GPGPU Lecture 7: Matrix Multiplications

Embed Size (px)

DESCRIPTION

CSE 690: GPGPU Lecture 7: Matrix Multiplications. Klaus Mueller Computer Science, Stony Brook University. Basic Concept. Triple loop. GPU Algorithms. First algorithm: render a rectangle of size NxN represent the matrices as NxN textures each (i,j) is then a fragment - PowerPoint PPT Presentation

Citation preview

Page 1: CSE 690: GPGPU Lecture 7: Matrix Multiplications

CSE 690: GPGPU

Lecture 7: Matrix Multiplications

Klaus Mueller

Computer Science, Stony Brook University

Page 2: CSE 690: GPGPU Lecture 7: Matrix Multiplications

• Triple loop

Basic Concept

Page 3: CSE 690: GPGPU Lecture 7: Matrix Multiplications

GPU Algorithms

• First algorithm: render a rectangle of size NxN represent the matrices as NxN textures each (i,j) is then a fragment each fragment program is a loop or an unrolled

loop -> may get too long must pull in the same data many times -> poor data

reuse, needs bandwidth makes no use of 4-way RGBA parallelism ->

wastes speedup

Page 4: CSE 690: GPGPU Lecture 7: Matrix Multiplications

GPU Algorithms

• Better algorithm: use RGBA channels, pack a 2x2 submatrix use swizzling to facilitate data reuse

swizzling improves fragment code length by factor 2 may need multiple passes for larger matrices

Page 5: CSE 690: GPGPU Lecture 7: Matrix Multiplications

GPU Algorithms

• Using multi-texturing requires l passes

Page 6: CSE 690: GPGPU Lecture 7: Matrix Multiplications

GPU Algorithms

• Can use RGBA parallelism as well each texel represents a 2x2 submatrix use swizzling as usual needs l/2 passes

Page 7: CSE 690: GPGPU Lecture 7: Matrix Multiplications

GPU Algorithms

• Instead of a 2x2 submatrix, pack 4x1 column vectors

makes 4-times reuse of texels read from B, but uses texels from A only once

Page 8: CSE 690: GPGPU Lecture 7: Matrix Multiplications

GPU Algorithms

• Instead of a 2x2 submatrix, pack 4x1 column vectors

6 fetches are needed for 4 mad’s (mult-add’s) -> 1.5 times more than before

but less rows and columns are accessed per pass -> improves cache hit frequency

Page 9: CSE 690: GPGPU Lecture 7: Matrix Multiplications

GPU Algorithms

• Originally only compute one product per shader

practically can unroll the loop 3-6 times (compute 3-6 products)

maximal fragment program length is the limit reduces the number of passes required

Page 10: CSE 690: GPGPU Lecture 7: Matrix Multiplications

Reality Check

• Would like to compare CPU and GPU efficiencies for GPGPU tasks

• The task of matrix multiplication is insightful here

features much data reuse graphics programs are generally more stream-like

and have less data reuse this may lead to some limitations

Page 11: CSE 690: GPGPU Lecture 7: Matrix Multiplications

Platforms

• Pentium 4 3Ghz CPU, 512KB L2 cache• 12 GFLOPS peak compute• 44.1GB/sec cache BW• Using sgemm routine from ATLAS package

• NVIDIA• GeForce 5900 Ultra• GeForce 6800 Ultra

• ATI• Radeon 9800 XT• Radeon X800 XT PE

Page 12: CSE 690: GPGPU Lecture 7: Matrix Multiplications

Performance

Page 13: CSE 690: GPGPU Lecture 7: Matrix Multiplications

Bandwidth vs. Peak FLOPS

Page 14: CSE 690: GPGPU Lecture 7: Matrix Multiplications

Analysis

• Currently: GPUs can fetch 16 floats and perform 16 4-

component mad’s per clock our app fetches 8 floats to perform one 4-

component mad -> not enough computations need more math ops per float fetched (> 8)

Page 15: CSE 690: GPGPU Lecture 7: Matrix Multiplications

Analysis

• Pentium processors have large L1 caches to boost memory bandwidth (bw)

bw / compute ratio better main reason for only small performance gain

achieved with GPUs

Page 16: CSE 690: GPGPU Lecture 7: Matrix Multiplications

Analysis

• Pentium processors have large L1 caches to boost memory bandwidth (bw)

bw / compute ratio better main reason for only small performance gain

achieved with GPUs

for matrix multiplications

Page 17: CSE 690: GPGPU Lecture 7: Matrix Multiplications

Analysis

• Expectations make sure that there is enough arithmetic per data

item fetched lots of data resuse in the algorithm / task will make

the CPU look better streaming data OK -> they don’t “suffer” from

reuse matrix multiplication is an excellent reality-check

example

Page 18: CSE 690: GPGPU Lecture 7: Matrix Multiplications

Analysis

• What do GPUs need: bigger caches to enable larger blocks currently there are enough registers to store a 6x6

submatrix but currently shaders can only produce a small

number of outputs -> limits the amount of blocking

Provide full-floating point accumulator registers Widen path between texture and register files

Page 19: CSE 690: GPGPU Lecture 7: Matrix Multiplications

References

E. Larsen and D. McAllister, “Fast matrix multiplies using graphics hardware,” Supercomputing 2001.

J. Hall, N. Carr and J. Hart, “Cache and bandwidth aware matrix multiplication on the GPU,” Tech Report UIUCDCS-R-2003-2328-1

K. Fatahalian, J. Sugerman, and P. Hanrahan, “Understanding the efficiency of GPU algorithms for matrix-matrix multiplication,” Graphics Hardware Workshop 2004.