Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Lecture 5: HW1 Discussion, Intro to GPUs
G63.2011.002/G22.2945.001 · October 5, 2010
Discuss HW1 Intro to GPU Computing
Outline
Discuss HW1
Intro to GPU Computing
Discuss HW1 Intro to GPU Computing
Outline
Discuss HW1
Intro to GPU Computing
Discuss HW1 Intro to GPU Computing
Dense Matrix Multiply: Blocking vs Scalar
We provided a blocked example matrixmultiplication code.Why is blocked matmul faster thanun-blocked?
Key: Computational Intensity
Definition:Flops per FPN moved up the memoryhierarchy
Large intensity: good for deep memoryhierarchies
Discuss HW1 Intro to GPU Computing
Computational Intensity for Scalar Matmul
Floating Point operations: 2N3
Assume: Size(L1) � N2 FPNs
N2 read each row of A once+ N3 read each column of B N times
+ 2N2 read/write C
N3 + 3N2 FPN-size cache misses
(neglecting cache lines, etc.)
Computational Intensity: about 2
Discuss HW1 Intro to GPU Computing
Computational Intensity for Blocked MatmulFloating Point operations: still 2N3
b: block size n: dN/be
b2n3 read each block of A n3 times+ b2n3 same for B+ 2N2 read/write C
2b2n3 + 2N2 FPN-size cache misses
Rewrite:
b2n3 ≈ b2 N3
b3=
N3
b
Computational Intensity:
2N3
2N3/b + 2N2≈ 2N3
2N3/b= b
→ incentive to choose b � 2
The power of assumptions:Can we choose b = N?
Discuss HW1 Intro to GPU Computing
Computational Intensity for Blocked MatmulFloating Point operations: still 2N3
b: block size n: dN/be
b2n3 read each block of A n3 times+ b2n3 same for B+ 2N2 read/write C
2b2n3 + 2N2 FPN-size cache misses
Rewrite:
b2n3 ≈ b2 N3
b3=
N3
b
Computational Intensity:
2N3
2N3/b + 2N2≈ 2N3
2N3/b= b
→ incentive to choose b � 2
The power of assumptions:Can we choose b = N?
Discuss HW1 Intro to GPU Computing
Hatching a Plan
Consider each level of the memory hierarchy.
How do we exploit. . .
• . . . L2: Ignore–we’re nearly L2-local atmost sizes.
• . . . L1: 32 KiB = 4096 Floats.Key: memory layout.
• . . . registers: 16 FP registers.Key: loop/operation ordering.
Discuss HW1 Intro to GPU Computing
Optimizing for L1: Memory Layout
Memory layout of A: column-major.
Only use one entry of each cache line perfetch.
Better to store A in row-major order.
Input is row-major. If memory available (notswap!), storing a transposed copy of A can bea good idea. (Copy takes O(N2) time.)
Discuss HW1 Intro to GPU Computing
Optimizing for L1: Memory Layout
Memory layout of A: column-major.
Only use one entry of each cache line perfetch.
Better to store A in row-major order.
Input is row-major. If memory available (notswap!), storing a transposed copy of A can bea good idea. (Copy takes O(N2) time.)
Discuss HW1 Intro to GPU Computing
Optimizing for L1: Reuse Pattern, Block Size
QuestionBlocking: good idea. Optimal bL1?
Follow-up question:How much needs to fit in L1?
One block of each of A, B, C .All of A, plus one column of B and C .32 kiB: 8b2
L1 + 2 · 8bL1 → bL1 ≤ 60
Discuss HW1 Intro to GPU Computing
Optimizing for L1: Reuse Pattern, Block Size
QuestionBlocking: good idea. Optimal bL1?
Follow-up question:How much needs to fit in L1?
One block of each of A, B, C .
All of A, plus one column of B and C .32 kiB: 8b2
L1 + 2 · 8bL1 → bL1 ≤ 60
Discuss HW1 Intro to GPU Computing
Optimizing for L1: Reuse Pattern, Block Size
QuestionBlocking: good idea. Optimal bL1?
Follow-up question:How much needs to fit in L1?
One block of each of A, B, C .
All of A, plus one column of B and C .32 kiB: 8b2
L1 + 2 · 8bL1 → bL1 ≤ 60
Discuss HW1 Intro to GPU Computing
Optimizing for L1: Reuse Pattern, Block Size
QuestionBlocking: good idea. Optimal bL1?
Follow-up question:How much needs to fit in L1?
One block of each of A, B, C .All of A, plus one column of B and C .32 kiB: 8b2
L1 + 2 · 8bL1 → bL1 ≤ 60
Discuss HW1 Intro to GPU Computing
L1 Block Copy
Further concerns:
• Cache line boundaries
• SIMD
• Cache set conflicts
All solved by small-block copyoptimization.
Copy all of A.Copy bL1-sized blocks of A, B, and C ,operate on those, then copy outputback.
Discuss HW1 Intro to GPU Computing
L1 Block Copy: The Plan
Basic plan:
For each i :For each j :
Load Block C [i , j ]For each k :
Load Block A[i , k]Load Block B[k , j ]dbL1/bre3 register kernels:
C + = ABStore Block C [i , j ]
(can be improved: many A, B loads)
Aside: Also neatly deals with fringes.
So: how does this solve the problems above?Can you define “alignment”?
Discuss HW1 Intro to GPU Computing
L1 Block Copy: The Plan
Basic plan:
For each i :For each j :
Load Block C [i , j ]For each k :
Load Block A[i , k]Load Block B[k , j ]dbL1/bre3 register kernels:
C + = ABStore Block C [i , j ]
(can be improved: many A, B loads)
Aside: Also neatly deals with fringes.
So: how does this solve the problems above?Can you define “alignment”?
Discuss HW1 Intro to GPU Computing
Alignment
A memory address a is n-byte aligned when n is a power of twoand a is a multiple of n bytes. (see also IBM devWorks article)
#include <stdlib.h>
/∗ dynamic allocation ∗/double ∗ attribute (( aligned (64))) var ;
int error = posix memalign((void ∗∗) &var, 64, array size );
if ( error )abort ();
/∗ static allocation ∗/double attribute (( aligned (64))) ary2 [500];
Examples: Cache-line-aligned, SIMD-aligned.
Code generation in the non-aligned case?
Discuss HW1 Intro to GPU Computing
Alignment
A memory address a is n-byte aligned when n is a power of twoand a is a multiple of n bytes. (see also IBM devWorks article)
#include <stdlib.h>
/∗ dynamic allocation ∗/double ∗ attribute (( aligned (64))) var ;
int error = posix memalign((void ∗∗) &var, 64, array size );
if ( error )abort ();
/∗ static allocation ∗/double attribute (( aligned (64))) ary2 [500];
Examples: Cache-line-aligned, SIMD-aligned.
Code generation in the non-aligned case?
Discuss HW1 Intro to GPU Computing
Register KernelChoose block size br = 2k ,with bL1 mod br = 0.
for ( int j = 0; j < b r; ++j)for ( int k = 0; k < b r; ++k)for ( int i = 0; i < b r; ++i)
C[i+j∗b l1] +=A[i+k∗b l1] ∗ B[k+j∗b l1 ];
For each Ab matvec:Perform br scalar·vector updates.
• Vectorizable
• Pipeline-friendly(min. data dependencies)
• Access to A, C unit-stride
• Access to B is inner-loop invariant
• Unrolling, software pipelining: Compiler
Discuss HW1 Intro to GPU Computing
Psychoanalyzing the Compiler
Flags for Intel:-O3 -fno-alias -funroll-loops
-std=c99 -D XOPEN SOURCE=500
-opt-streaming-stores auto -static
-fast -xHost
Flags for GCC:-O3 -funroll-loops -march=native
-std=c99 -D XOPEN SOURCE=500
-ftree-vectorizer-verbose=2
-ffast-math
GCC 4.3 sometimes better than GCC 4.4.
Self-study material:
• Compiler Reference: Intel GNU
• C99 restrict keyword, Aliasing
Discuss HW1 Intro to GPU Computing
Profiling
OProfile: A sampling profiler. Uses Performance counters. Linuxonly, needs root.
Many event types countable:
CPU CLK UNHALTED : Clock cycles when not halted
L2 RQSTS : number of L2 cache requests
LLC MISSES : L2 cache demand requests from this core that
missed the L2
FLOPS : number of FP computational micro-ops executed
IDLE DURING DIV : cycles divider is busy and all other
execution units are idle.
L1D ALL REF : All references to the L1 data cache
L1D PEND MISS : Total number of outstanding L1 data cache
misses at any cycle
IFU MEM STALL : cycles instruction fetch pipe is stalled
INST RETIRED : number of instructions retired
UOPS RETIRED : number of UOPs retired
MACHINE NUKES SMC : number of pipeline flushing events
RAT STALLS : Partial register stall cycles
BR INST DECODED : number of branch instructions decoded
FLOPS L1D PEND MISS8 2.6e−04 18 0.7037 movsd 0x50(%rax),%xmm7
187 0.0062 8 0.3127 movsd 0x58(%rax),%xmm57 2.3e−04 24 0.9382 movsd 0x60(%rax),%xmm3
470 0.0155 18 0.7037 movsd 0x68(%rax),%xmm449 0.0016 9 0.3518 movsd 0x70(%rax),%xmm2
2873 0.0950 7 0.2737 movsd 0x78(%rax),%xmm1434 0.0144 8 0.3127 xchg %ax,%ax
184312 6.0959 26 1.0164 movsd (%rdx),%xmm02022 0.0669 14 0.5473 inc %esi19 6.3e−04 3 0.1173 mulsd (%rcx),%xmm0
5294 0.1751 189 7.3886 addsd 0x30(%rsp),%xmm031888 1.0547 68 2.6583 movsd %xmm0,(%rax)66032 2.1839 37 1.4464 movsd %xmm0,0x30(%rsp)114001 3.7704 43 1.6810 movsd (%rcx),%xmm01131 0.0374 3 0.1173 mulsd 0x8(%rdx),%xmm011913 0.3940 2 0.0782 addsd %xmm0,%xmm1494565 3.1276 20 0.7819 movsd %xmm14,0x8(%rax)108501 3.5885 25 0.9773 movsd (%rcx),%xmm0
4 1.3e−04 1 0.0391 mulsd 0x10(%rdx),%xmm076622 2.5342 81 3.1665 addsd %xmm0,%xmm1582075 2.7145 42 1.6419 movsd %xmm15,0x10(%rax)119036 3.9370 36 1.4073 movsd (%rcx),%xmm0
5 1.7e−04 0 0 mulsd 0x18(%rdx),%xmm02700 0.0893 0 0 addsd %xmm0,%xmm1214861 0.4915 11 0.4300 movsd %xmm12,0x18(%rax)
Discuss HW1 Intro to GPU Computing
Profiling
OProfile: A sampling profiler. Uses Performance counters. Linuxonly, needs root.
Many event types countable:
CPU CLK UNHALTED : Clock cycles when not halted
L2 RQSTS : number of L2 cache requests
LLC MISSES : L2 cache demand requests from this core that
missed the L2
FLOPS : number of FP computational micro-ops executed
IDLE DURING DIV : cycles divider is busy and all other
execution units are idle.
L1D ALL REF : All references to the L1 data cache
L1D PEND MISS : Total number of outstanding L1 data cache
misses at any cycle
IFU MEM STALL : cycles instruction fetch pipe is stalled
INST RETIRED : number of instructions retired
UOPS RETIRED : number of UOPs retired
MACHINE NUKES SMC : number of pipeline flushing events
RAT STALLS : Partial register stall cycles
BR INST DECODED : number of branch instructions decoded
FLOPS L1D PEND MISS8 2.6e−04 18 0.7037 movsd 0x50(%rax),%xmm7
187 0.0062 8 0.3127 movsd 0x58(%rax),%xmm57 2.3e−04 24 0.9382 movsd 0x60(%rax),%xmm3
470 0.0155 18 0.7037 movsd 0x68(%rax),%xmm449 0.0016 9 0.3518 movsd 0x70(%rax),%xmm2
2873 0.0950 7 0.2737 movsd 0x78(%rax),%xmm1434 0.0144 8 0.3127 xchg %ax,%ax
184312 6.0959 26 1.0164 movsd (%rdx),%xmm02022 0.0669 14 0.5473 inc %esi19 6.3e−04 3 0.1173 mulsd (%rcx),%xmm0
5294 0.1751 189 7.3886 addsd 0x30(%rsp),%xmm031888 1.0547 68 2.6583 movsd %xmm0,(%rax)66032 2.1839 37 1.4464 movsd %xmm0,0x30(%rsp)114001 3.7704 43 1.6810 movsd (%rcx),%xmm01131 0.0374 3 0.1173 mulsd 0x8(%rdx),%xmm011913 0.3940 2 0.0782 addsd %xmm0,%xmm1494565 3.1276 20 0.7819 movsd %xmm14,0x8(%rax)108501 3.5885 25 0.9773 movsd (%rcx),%xmm0
4 1.3e−04 1 0.0391 mulsd 0x10(%rdx),%xmm076622 2.5342 81 3.1665 addsd %xmm0,%xmm1582075 2.7145 42 1.6419 movsd %xmm15,0x10(%rax)119036 3.9370 36 1.4073 movsd (%rcx),%xmm0
5 1.7e−04 0 0 mulsd 0x18(%rdx),%xmm02700 0.0893 0 0 addsd %xmm0,%xmm1214861 0.4915 11 0.4300 movsd %xmm12,0x18(%rax)
Discuss HW1 Intro to GPU Computing
Profiling
OProfile: A sampling profiler. Uses Performance counters. Linuxonly, needs root.
Many event types countable:
CPU CLK UNHALTED : Clock cycles when not halted
L2 RQSTS : number of L2 cache requests
LLC MISSES : L2 cache demand requests from this core that
missed the L2
FLOPS : number of FP computational micro-ops executed
IDLE DURING DIV : cycles divider is busy and all other
execution units are idle.
L1D ALL REF : All references to the L1 data cache
L1D PEND MISS : Total number of outstanding L1 data cache
misses at any cycle
IFU MEM STALL : cycles instruction fetch pipe is stalled
INST RETIRED : number of instructions retired
UOPS RETIRED : number of UOPs retired
MACHINE NUKES SMC : number of pipeline flushing events
RAT STALLS : Partial register stall cycles
BR INST DECODED : number of branch instructions decoded
FLOPS L1D PEND MISS8 2.6e−04 18 0.7037 movsd 0x50(%rax),%xmm7
187 0.0062 8 0.3127 movsd 0x58(%rax),%xmm57 2.3e−04 24 0.9382 movsd 0x60(%rax),%xmm3
470 0.0155 18 0.7037 movsd 0x68(%rax),%xmm449 0.0016 9 0.3518 movsd 0x70(%rax),%xmm2
2873 0.0950 7 0.2737 movsd 0x78(%rax),%xmm1434 0.0144 8 0.3127 xchg %ax,%ax
184312 6.0959 26 1.0164 movsd (%rdx),%xmm02022 0.0669 14 0.5473 inc %esi19 6.3e−04 3 0.1173 mulsd (%rcx),%xmm0
5294 0.1751 189 7.3886 addsd 0x30(%rsp),%xmm031888 1.0547 68 2.6583 movsd %xmm0,(%rax)66032 2.1839 37 1.4464 movsd %xmm0,0x30(%rsp)114001 3.7704 43 1.6810 movsd (%rcx),%xmm01131 0.0374 3 0.1173 mulsd 0x8(%rdx),%xmm011913 0.3940 2 0.0782 addsd %xmm0,%xmm1494565 3.1276 20 0.7819 movsd %xmm14,0x8(%rax)108501 3.5885 25 0.9773 movsd (%rcx),%xmm0
4 1.3e−04 1 0.0391 mulsd 0x10(%rdx),%xmm076622 2.5342 81 3.1665 addsd %xmm0,%xmm1582075 2.7145 42 1.6419 movsd %xmm15,0x10(%rax)119036 3.9370 36 1.4073 movsd (%rcx),%xmm0
5 1.7e−04 0 0 mulsd 0x18(%rdx),%xmm02700 0.0893 0 0 addsd %xmm0,%xmm1214861 0.4915 11 0.4300 movsd %xmm12,0x18(%rax)
Discuss HW1 Intro to GPU Computing
Profiling
OProfile: A sampling profiler. Uses Performance counters. Linuxonly, needs root.
Many event types countable:
CPU CLK UNHALTED : Clock cycles when not halted
L2 RQSTS : number of L2 cache requests
LLC MISSES : L2 cache demand requests from this core that
missed the L2
FLOPS : number of FP computational micro-ops executed
IDLE DURING DIV : cycles divider is busy and all other
execution units are idle.
L1D ALL REF : All references to the L1 data cache
L1D PEND MISS : Total number of outstanding L1 data cache
misses at any cycle
IFU MEM STALL : cycles instruction fetch pipe is stalled
INST RETIRED : number of instructions retired
UOPS RETIRED : number of UOPs retired
MACHINE NUKES SMC : number of pipeline flushing events
RAT STALLS : Partial register stall cycles
BR INST DECODED : number of branch instructions decoded
FLOPS L1D PEND MISS8 2.6e−04 18 0.7037 movsd 0x50(%rax),%xmm7
187 0.0062 8 0.3127 movsd 0x58(%rax),%xmm57 2.3e−04 24 0.9382 movsd 0x60(%rax),%xmm3
470 0.0155 18 0.7037 movsd 0x68(%rax),%xmm449 0.0016 9 0.3518 movsd 0x70(%rax),%xmm2
2873 0.0950 7 0.2737 movsd 0x78(%rax),%xmm1434 0.0144 8 0.3127 xchg %ax,%ax
184312 6.0959 26 1.0164 movsd (%rdx),%xmm02022 0.0669 14 0.5473 inc %esi19 6.3e−04 3 0.1173 mulsd (%rcx),%xmm0
5294 0.1751 189 7.3886 addsd 0x30(%rsp),%xmm031888 1.0547 68 2.6583 movsd %xmm0,(%rax)66032 2.1839 37 1.4464 movsd %xmm0,0x30(%rsp)114001 3.7704 43 1.6810 movsd (%rcx),%xmm01131 0.0374 3 0.1173 mulsd 0x8(%rdx),%xmm011913 0.3940 2 0.0782 addsd %xmm0,%xmm1494565 3.1276 20 0.7819 movsd %xmm14,0x8(%rax)108501 3.5885 25 0.9773 movsd (%rcx),%xmm0
4 1.3e−04 1 0.0391 mulsd 0x10(%rdx),%xmm076622 2.5342 81 3.1665 addsd %xmm0,%xmm1582075 2.7145 42 1.6419 movsd %xmm15,0x10(%rax)119036 3.9370 36 1.4073 movsd (%rcx),%xmm0
5 1.7e−04 0 0 mulsd 0x18(%rdx),%xmm02700 0.0893 0 0 addsd %xmm0,%xmm1214861 0.4915 11 0.4300 movsd %xmm12,0x18(%rax)
Discuss HW1 Intro to GPU Computing
Profiling
OProfile: A sampling profiler. Uses Performance counters. Linuxonly, needs root.
Many event types countable:
CPU CLK UNHALTED : Clock cycles when not halted
L2 RQSTS : number of L2 cache requests
LLC MISSES : L2 cache demand requests from this core that
missed the L2
FLOPS : number of FP computational micro-ops executed
IDLE DURING DIV : cycles divider is busy and all other
execution units are idle.
L1D ALL REF : All references to the L1 data cache
L1D PEND MISS : Total number of outstanding L1 data cache
misses at any cycle
IFU MEM STALL : cycles instruction fetch pipe is stalled
INST RETIRED : number of instructions retired
UOPS RETIRED : number of UOPs retired
MACHINE NUKES SMC : number of pipeline flushing events
RAT STALLS : Partial register stall cycles
BR INST DECODED : number of branch instructions decoded
FLOPS L1D PEND MISS8 2.6e−04 18 0.7037 movsd 0x50(%rax),%xmm7
187 0.0062 8 0.3127 movsd 0x58(%rax),%xmm57 2.3e−04 24 0.9382 movsd 0x60(%rax),%xmm3
470 0.0155 18 0.7037 movsd 0x68(%rax),%xmm449 0.0016 9 0.3518 movsd 0x70(%rax),%xmm2
2873 0.0950 7 0.2737 movsd 0x78(%rax),%xmm1434 0.0144 8 0.3127 xchg %ax,%ax
184312 6.0959 26 1.0164 movsd (%rdx),%xmm02022 0.0669 14 0.5473 inc %esi19 6.3e−04 3 0.1173 mulsd (%rcx),%xmm0
5294 0.1751 189 7.3886 addsd 0x30(%rsp),%xmm031888 1.0547 68 2.6583 movsd %xmm0,(%rax)66032 2.1839 37 1.4464 movsd %xmm0,0x30(%rsp)114001 3.7704 43 1.6810 movsd (%rcx),%xmm01131 0.0374 3 0.1173 mulsd 0x8(%rdx),%xmm011913 0.3940 2 0.0782 addsd %xmm0,%xmm1494565 3.1276 20 0.7819 movsd %xmm14,0x8(%rax)108501 3.5885 25 0.9773 movsd (%rcx),%xmm0
4 1.3e−04 1 0.0391 mulsd 0x10(%rdx),%xmm076622 2.5342 81 3.1665 addsd %xmm0,%xmm1582075 2.7145 42 1.6419 movsd %xmm15,0x10(%rax)119036 3.9370 36 1.4073 movsd (%rcx),%xmm0
5 1.7e−04 0 0 mulsd 0x18(%rdx),%xmm02700 0.0893 0 0 addsd %xmm0,%xmm1214861 0.4915 11 0.4300 movsd %xmm12,0x18(%rax)
Discuss HW1 Intro to GPU Computing
Solution Performance
0 100 200 300 400 500 600 700 800Matrix dimension N
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
MFl
ops/
s
basic
tuned
blas
git clone
ssh://[email protected]:2234/hw1-solution.git
(Private, works if you signed up for an account.)
Great–but:Most BLAS lose out to triple-loops forspecial-case matrices.
Want to see code of a “real” BLAS?GotoBLAS2
Discuss HW1 Intro to GPU Computing
Solution Performance
0 100 200 300 400 500 600 700 800Matrix dimension N
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
MFl
ops/
s
basic
tuned
blas
git clone
ssh://[email protected]:2234/hw1-solution.git
(Private, works if you signed up for an account.)
Great–but:Most BLAS lose out to triple-loops forspecial-case matrices.
Want to see code of a “real” BLAS?GotoBLAS2
Discuss HW1 Intro to GPU Computing
Key Messages of HW1
In HPC:
• Very simple things quickly becomerather complex.
• Need: ideas, careful analysis.
• Flexibility ↔ performance
• Run-time code generation can beuseful.
This class helps by introducing
• known tricks
• helpful tools.
Matmul is a “microcosm” of single-procoptimization.
Do not worry if you did not figure outthe tricks here on your own.
Discuss HW1 Intro to GPU Computing
Key Messages of HW1
In HPC:
• Very simple things quickly becomerather complex.
• Need: ideas, careful analysis.
• Flexibility ↔ performance
• Run-time code generation can beuseful.
This class helps by introducing
• known tricks
• helpful tools.
Matmul is a “microcosm” of single-procoptimization.
Do not worry if you did not figure outthe tricks here on your own.
Discuss HW1 Intro to GPU Computing
Questions?
?
Discuss HW1 Intro to GPU Computing
Outline
Discuss HW1
Intro to GPU Computing
Discuss HW1 Intro to GPU Computing
GPUs: System Context
Processor
Memory
Expansion Slots
PCI-Express (x4, x16, x1, x16)and regular PCI
PCIe V2, x16 Bandwidth:∼ 6 GB/s
GPU goes here
Discuss HW1 Intro to GPU Computing
GPUs: System Context
Processor
Memory
Expansion Slots
PCI-Express (x4, x16, x1, x16)and regular PCI
PCIe V2, x16 Bandwidth:∼ 6 GB/s
GPU goes here
Discuss HW1 Intro to GPU Computing
GPUs: System Context
Processor
Memory
Expansion Slots
PCI-Express (x4, x16, x1, x16)and regular PCI
PCIe V2, x16 Bandwidth:∼ 6 GB/s
GPU goes here
Discuss HW1 Intro to GPU Computing
GPUs: System Context
Processor
Memory
Expansion Slots
PCI-Express (x4, x16, x1, x16)and regular PCI
PCIe V2, x16 Bandwidth:∼ 6 GB/s
GPU goes here
Discuss HW1 Intro to GPU Computing
GPUs: System Context
Processor
Memory
Expansion Slots
PCI-Express (x4, x16, x1, x16)and regular PCI
PCIe V2, x16 Bandwidth:∼ 6 GB/s
GPU goes here
Discuss HW1 Intro to GPU Computing
GPU Computing?
• Design target for CPUs:• Make a single thread very fast• Take control away from
programmer
• GPU Computing takes adifferent approach:
• Throughput matters—single threads do not
• Give explicit control toprogrammer
Discuss HW1 Intro to GPU Computing
“CPU-style” Cores
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
CPU-“style” cores
ALU (Execute)
Fetch/ Decode
Execution Context
Out-of-order control logic
Fancy branch predictor
Memory pre-fetcher
Data cache (A big one)
13
Credit: Kayvon Fatahalian (Stanford)
Discuss HW1 Intro to GPU Computing
Slimming down
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Slimming down
ALU (Execute)
Fetch/ Decode
Execution Context
Idea #1:
Remove components that help a single instruction stream run fast
14
Credit: Kayvon Fatahalian (Stanford)
Discuss HW1 Intro to GPU Computing
More Space: Double the Number of Cores
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Two cores (two fragments in parallel)
ALU (Execute)
Fetch/ Decode
Execution Context
ALU (Execute)
Fetch/ Decode
Execution Context
<diffuseShader>:
sample r0, v4, t0, s0
mul r3, v0, cb0[0]
madd r3, v1, cb0[1], r3
madd r3, v2, cb0[2], r3
clmp r3, r3, l(0.0), l(1.0)
mul o0, r0, r3
mul o1, r1, r3
mul o2, r2, r3
mov o3, l(1.0)
fragment 1
<diffuseShader>:
sample r0, v4, t0, s0
mul r3, v0, cb0[0]
madd r3, v1, cb0[1], r3
madd r3, v2, cb0[2], r3
clmp r3, r3, l(0.0), l(1.0)
mul o0, r0, r3
mul o1, r1, r3
mul o2, r2, r3
mov o3, l(1.0)
fragment 2
15
Credit: Kayvon Fatahalian (Stanford)
Discuss HW1 Intro to GPU Computing
. . . again
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Four cores (four fragments in parallel)
ALU (Execute)
Fetch/ Decode
Execution Context
ALU (Execute)
Fetch/ Decode
Execution Context
ALU (Execute)
Fetch/ Decode
Execution Context
ALU (Execute)
Fetch/ Decode
Execution Context
16
Credit: Kayvon Fatahalian (Stanford)
Discuss HW1 Intro to GPU Computing
. . . and again
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Sixteen cores (sixteen fragments in parallel)
ALU ALU
ALU ALU
ALU ALU
ALU ALU
ALU ALU
ALU ALU
ALU ALU
ALU ALU
16 cores = 16 simultaneous instruction streams 17
Credit: Kayvon Fatahalian (Stanford)
→ 16 independent instruction streams
Reality: instruction streams not actuallyvery different/independent
Discuss HW1 Intro to GPU Computing
. . . and again
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Sixteen cores (sixteen fragments in parallel)
ALU ALU
ALU ALU
ALU ALU
ALU ALU
ALU ALU
ALU ALU
ALU ALU
ALU ALU
16 cores = 16 simultaneous instruction streams 17
Credit: Kayvon Fatahalian (Stanford)
→ 16 independent instruction streams
Reality: instruction streams not actuallyvery different/independent
Discuss HW1 Intro to GPU Computing
Saving Yet More Space
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Recall: simple processing core
Fetch/ Decode
ALU (Execute)
Execution Context
19 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Recall: simple processing core
Fetch/ Decode
ALU (Execute)
Execution Context
19
Idea #2
Amortize cost/complexity ofmanaging an instruction streamacross many ALUs
→ SIMD
Credit: Kayvon Fatahalian (Stanford)
Discuss HW1 Intro to GPU Computing
Saving Yet More Space
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Recall: simple processing core
Fetch/ Decode
ALU (Execute)
Execution Context
19 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Recall: simple processing core
Fetch/ Decode
ALU (Execute)
Execution Context
19
Idea #2
Amortize cost/complexity ofmanaging an instruction streamacross many ALUs
→ SIMD
Credit: Kayvon Fatahalian (Stanford)
Discuss HW1 Intro to GPU Computing
Saving Yet More Space
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Add ALUs
Fetch/ Decode
Idea #2:
Amortize cost/complexity of managing an instruction stream across many ALUs
ALU 1 ALU 2 ALU 3 ALU 4
ALU 5 ALU 6 ALU 7 ALU 8
SIMD processing Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Shared Ctx Data
20 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Recall: simple processing core
Fetch/ Decode
ALU (Execute)
Execution Context
19
Idea #2
Amortize cost/complexity ofmanaging an instruction streamacross many ALUs
→ SIMD
Credit: Kayvon Fatahalian (Stanford)
Discuss HW1 Intro to GPU Computing
Saving Yet More Space
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Add ALUs
Fetch/ Decode
Idea #2:
Amortize cost/complexity of managing an instruction stream across many ALUs
ALU 1 ALU 2 ALU 3 ALU 4
ALU 5 ALU 6 ALU 7 ALU 8
SIMD processing Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Shared Ctx Data
20 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Add ALUs
Fetch/ Decode
Idea #2:
Amortize cost/complexity of managing an instruction stream across many ALUs
ALU 1 ALU 2 ALU 3 ALU 4
ALU 5 ALU 6 ALU 7 ALU 8
SIMD processing Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Shared Ctx Data
20
Idea #2
Amortize cost/complexity ofmanaging an instruction streamacross many ALUs
→ SIMD
Credit: Kayvon Fatahalian (Stanford)
Discuss HW1 Intro to GPU Computing
Gratuitous Amounts of Parallelism!
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
128 fragments in parallel
= 16 simultaneous instruction streams 16 cores = 128 ALUs
24 Credit: Kayvon Fatahalian (Stanford)
Example:
128 instruction streams in parallel16 independent groups of 8 synchronized streams
Great if everybody in a group does thesame thing.
But what if not?
What leads to divergent instructionstreams?
Discuss HW1 Intro to GPU Computing
Gratuitous Amounts of Parallelism!
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
128 fragments in parallel
= 16 simultaneous instruction streams 16 cores = 128 ALUs
24 Credit: Kayvon Fatahalian (Stanford)
Example:
128 instruction streams in parallel16 independent groups of 8 synchronized streams
Great if everybody in a group does thesame thing.
But what if not?
What leads to divergent instructionstreams?
Discuss HW1 Intro to GPU Computing
Gratuitous Amounts of Parallelism!
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
128 fragments in parallel
= 16 simultaneous instruction streams 16 cores = 128 ALUs
24 Credit: Kayvon Fatahalian (Stanford)
Example:
128 instruction streams in parallel16 independent groups of 8 synchronized streams
Great if everybody in a group does thesame thing.
But what if not?
What leads to divergent instructionstreams?
Discuss HW1 Intro to GPU Computing
Branches
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
But what about branches?
ALU 1 ALU 2 . . . ALU 8 . . . Time
(clocks)
2 ... 1 ... 8
if (x > 0) {
} else {
}
<unconditional shader code>
<resume unconditional shader code>
y = pow(x, exp);
y *= Ks;
refl = y + Ka;
x = 0;
refl = Ka;
26
Credit: Kayvon Fatahalian (Stanford)
Discuss HW1 Intro to GPU Computing
Branches
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
But what about branches?
ALU 1 ALU 2 . . . ALU 8 . . . Time
(clocks)
2 ... 1 ... 8
if (x > 0) {
} else {
}
<unconditional shader code>
<resume unconditional shader code>
y = pow(x, exp);
y *= Ks;
refl = y + Ka;
x = 0;
refl = Ka;
T T T F F F F F
27
Credit: Kayvon Fatahalian (Stanford)
Discuss HW1 Intro to GPU Computing
Branches
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
But what about branches?
ALU 1 ALU 2 . . . ALU 8 . . . Time
(clocks)
2 ... 1 ... 8
if (x > 0) {
} else {
}
<unconditional shader code>
<resume unconditional shader code>
y = pow(x, exp);
y *= Ks;
refl = y + Ka;
x = 0;
refl = Ka;
T T T F F F F F
Not all ALUs do useful work! Worst case: 1/8 performance
28
Credit: Kayvon Fatahalian (Stanford)
Discuss HW1 Intro to GPU Computing
Branches
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
But what about branches?
ALU 1 ALU 2 . . . ALU 8 . . . Time
(clocks)
2 ... 1 ... 8
if (x > 0) {
} else {
}
<unconditional shader code>
<resume unconditional shader code>
y = pow(x, exp);
y *= Ks;
refl = y + Ka;
x = 0;
refl = Ka;
T T T F F F F F
29
Credit: Kayvon Fatahalian (Stanford)
Discuss HW1 Intro to GPU Computing
Remaining Problem: Slow Memory
ProblemMemory still has very high latency. . .. . . but we’ve removed most of thehardware that helps us deal with that.
We’ve removed
• caches
• branch prediction
• out-of-order execution
So what now?
Idea #3
Even more parallelism+ Some extra memory
= A solution!
Discuss HW1 Intro to GPU Computing
Remaining Problem: Slow Memory
ProblemMemory still has very high latency. . .. . . but we’ve removed most of thehardware that helps us deal with that.
We’ve removed
• caches
• branch prediction
• out-of-order execution
So what now?
Idea #3
Even more parallelism+ Some extra memory
= A solution!
Discuss HW1 Intro to GPU Computing
Remaining Problem: Slow Memory
ProblemMemory still has very high latency. . .. . . but we’ve removed most of thehardware that helps us deal with that.
We’ve removed
• caches
• branch prediction
• out-of-order execution
So what now?SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Hiding shader stalls Time
(clocks) Frag 1 … 8
Fetch/ Decode
Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Shared Ctx Data
ALU ALU ALU ALU
ALU ALU ALU ALU
33
Idea #3
Even more parallelism+ Some extra memory
= A solution!
Discuss HW1 Intro to GPU Computing
Remaining Problem: Slow Memory
ProblemMemory still has very high latency. . .. . . but we’ve removed most of thehardware that helps us deal with that.
We’ve removed
• caches
• branch prediction
• out-of-order execution
So what now?SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Hiding shader stalls Time
(clocks)
Fetch/ Decode
ALU ALU ALU ALU
ALU ALU ALU ALU
1 2
3 4
1 2 3 4
Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32
34
Idea #3
Even more parallelism+ Some extra memory
= A solution!
Discuss HW1 Intro to GPU Computing
Hiding Memory Latency
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Hiding shader stalls Time
(clocks) Frag 1 … 8
Fetch/ Decode
Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Shared Ctx Data
ALU ALU ALU ALU
ALU ALU ALU ALU
33
Credit: Kayvon Fatahalian (Stanford)
Discuss HW1 Intro to GPU Computing
Hiding Memory Latency
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Hiding shader stalls Time
(clocks)
Fetch/ Decode
ALU ALU ALU ALU
ALU ALU ALU ALU
1 2
3 4
1 2 3 4
Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32
34
Credit: Kayvon Fatahalian (Stanford)
Discuss HW1 Intro to GPU Computing
Hiding Memory Latency
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Hiding shader stalls Time
(clocks)
Stall
Runnable
1 2 3 4
Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32
35
Credit: Kayvon Fatahalian (Stanford)
Discuss HW1 Intro to GPU Computing
Hiding Memory Latency
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Hiding shader stalls Time
(clocks)
Stall
Runnable
1 2 3 4
Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32
36
Credit: Kayvon Fatahalian (Stanford)
Discuss HW1 Intro to GPU Computing
Hiding Memory Latency
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Hiding shader stalls Time
(clocks)
1 2 3 4
Stall
Stall
Stall
Stall
Runnable
Runnable
Runnable
Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32
37
Credit: Kayvon Fatahalian (Stanford)
Discuss HW1 Intro to GPU Computing
Hiding Memory Latency
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Throughput! Time
(clocks)
Stall
Runnable
2 3 4
Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32
Done!
Stall
Runnable
Done!
Stall
Runnable
Done!
Stall
Runnable
Done!
1
Increase run time of one group To maximum throughput of many groups
Start
Start
Start
38
Credit: Kayvon Fatahalian (Stanford)
Discuss HW1 Intro to GPU Computing
GPU Architecture Summary
Core Ideas:
1. Many slimmed down cores→ lots of parallelism
2. More ALUs, Fewer Control Units
3. Avoid memory stalls by interleavingexecution of SIMD groups
Credit: Kayvon Fatahalian (Stanford)
Discuss HW1 Intro to GPU Computing
GPU-CPU Bird’s Eye Comparison
Floorplan: VIA Isaiah (2008)65 nm, 4 SP ops at a time, 1MiB L2.
Floorplan: AMD RV770 (2008)55 nm, 800 SP opsat a time.
Discuss HW1 Intro to GPU Computing
Nvidia GTX200
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Fetch/Decode
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
4×
DP ALU
32 kiB CtxPrivate
16 kiB CtxShared
Off-chip Memory150 GB/s
Discuss HW1 Intro to GPU Computing
GPU Architecture (e.g. Nvidia GT200)
• 1 GPU = 30 SIMD cores
• 1 SIMD core: 32× 32 PCs,HW Sched + 1 ID (1/4 clock) +8 SP + 1 DP + 16 KiB Shared +32 KiB Reg
• Device ↔ RAM: 140 GB/s
• Device ↔ Host: 6 GB/s
• User manages memory hierarchy
Discuss HW1 Intro to GPU Computing
What is OpenCL?
OpenCL (Open Computing Language) is anopen, royalty-free standard for general purposeparallel programming across CPUs, GPUs andother processors. [OpenCL 1.1 spec]
• Device-neutral (Nv GPU, AMD GPU,Intel/AMD CPU)
• Vendor-neutral
• Comes with RTCG
Defines:
• Host-side programming interface (library)
• Device-side programming language (!)
Discuss HW1 Intro to GPU Computing
Questions?
?
Discuss HW1 Intro to GPU Computing
Image Credits
• Blocks: sxc.hu/Avolore• Flag: sxc.hu/Ambrozjo
• Mainboard: Wikimedia Commons
• PCI Express slots: Wikimedia Commons
• Fighting chips: flickr.com/oskay• Isaiah die shot: VIA Technologies• RV770 die shot: AMD Corp.• Nvidia Tesla Architecture: Nvidia Corp.
Discuss HW1 Intro to GPU Computing