94
Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Graphics Processing Unit (GPU)Architecture and Programming

TU/e 5kk73Zhenyu Ye

Bart MesmanHenk Corporaal

2010-11-08

Page 2: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Today's Topics

• GPU architecture• GPU programming• GPU micro-architecture• Performance optimization and model• Trends

Page 3: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Today's Topics

• GPU architecture• GPU programming• GPU micro-architecture• Performance optimization and model• Trends

Page 4: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

System Architecture

Page 5: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

GPU ArchitectureNVIDIA Fermi, 512 Processing Elements (PEs)

Page 6: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

What Can It Do?Render triangles.

NVIDIA GTX480 can render 1.6 billion triangles per second!

Page 7: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

General Purposed Computing

ref: http://www.nvidia.com/object/tesla_computing_solutions.html

Page 8: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

The Vision of NVIDIA"Within the next few years, there will be single-chip graphics

devices more powerful and versatile than any graphics system that has ever been built, at any price." 

-- David Kirk, NVIDIA, 1998

Page 9: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

ref: http://www.llnl.gov/str/JanFeb05/Seager.html

Single-Chip GPU v.s. Fastest Super Computers

Page 10: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Top500 Super Computer in June 2010

Page 11: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

GPU Will Top the List in Nov 2010

Page 12: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

The Gap Between CPU and GPU

ref: Tesla GPU Computing Brochure

Page 13: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

GPU Has 10x Comp Density

Given the same chip area, the achievable performance of GPU is 10x higher than that of CPU.

Page 14: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Evolution of Intel PentiumPentium I Pentium II

Pentium III Pentium IV

Chip areabreakdown

Q: What can you observe? Why?

Page 15: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Extrapolation of Single Core CPUIf we extrapolate the trend, in a few generations, Pentium will look like:

Of course, we know it did not happen. 

Q: What happened instead? Why?

Page 16: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Evolution of Multi-core CPUsPenryn Bloomfield

Gulftown Beckton

Chip areabreakdown

Q: What can you observe? Why?

Page 17: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Let's Take a Closer Look

Less than 10% of total chip area is used for the real execution.

Q: Why?

Page 18: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

The Memory Hierarchy

Notes on Energy at 45nm: 64-bit Int ADD takes about 1 pJ.64-bit FP FMA takes about 200 pJ.

It seems we can not further increase the computational density.

Page 19: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

The Brick Wall -- UC Berkeley's ViewPower Wall: power expensive, transistors freeMemory Wall: Memory slow, multiplies fastILP Wall: diminishing returns on more ILP HW

David Patterson, "Computer Architecture is Back - The Berkeley View of the Parallel Computing Research Landscape", Stanford EE Computer Systems Colloquium, Jan 2007, link

Page 20: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

The Brick Wall -- UC Berkeley's ViewPower Wall: power expensive, transistors freeMemory Wall: Memory slow, multiplies fastILP Wall: diminishing returns on more ILP HW

Power Wall + Memory Wall + ILP Wall = Brick Wall

David Patterson, "Computer Architecture is Back - The Berkeley View of the Parallel Computing Research Landscape", Stanford EE Computer Systems Colloquium, Jan 2007, link

Page 21: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

How to Break the Brick Wall?

Hint: how to exploit the parallelism inside the application?

Page 22: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Step 1: Trade Latency with Throughput

Hind the memory latency through fine-grained interleaved threading.

Page 23: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Interleaved Multi-threading

Page 24: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Interleaved Multi-threading

The granularity of interleaved multi-threading:• 100 cycles: hide off-chip memory latency• 10 cycles: + hide cache latency• 1 cycle: + hide branch latency, instruction dependency

Page 25: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Interleaved Multi-threading

The granularity of interleaved multi-threading:• 100 cycles: hide off-chip memory latency• 10 cycles: + hide cache latency• 1 cycle: + hide branch latency, instruction dependency

Fine-grained interleaved multi-threading:Pros: ?Cons: ?

Page 26: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Interleaved Multi-threading

The granularity of interleaved multi-threading:• 100 cycles: hide off-chip memory latency• 10 cycles: + hide cache latency• 1 cycle: + hide branch latency, instruction dependency

Fine-grained interleaved multi-threading:Pros: remove branch predictor, OOO scheduler, large cacheCons: register pressure, etc.

Page 27: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Fine-Grained Interleaved Threading

Pros: reduce cache size,no branch predictor, no OOO scheduler

Cons: register pressure,thread scheduler,require huge parallelism

Without and with fine-grained interleaved threading

Page 28: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

HW SupportRegister file supports zero overhead context switch between interleaved threads.

Page 29: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Can We Make Further Improvement?

Reducing large cache gives 2x computational density.

Q: Can we make further improvements?

Hint:We have only utilized thread level parallelism (TLP) so far.

Page 30: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Step 2: Single Instruction Multiple Data

SSE has 4 data lanes GPU has 8/16/24/... data lanes

GPU uses wide SIMD: 8/16/24/... processing elements (PEs)CPU uses short SIMD: usually has vector width of 4.

Page 31: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Hardware SupportSupporting interleaved threading + SIMD execution

Page 32: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Single Instruction Multiple Thread (SIMT)

Hide vector width using scalar threads.

Page 33: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Example of SIMT ExecutionAssume 32 threads are grouped into one warp.

Page 34: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Step 3: Simple Core

The Stream Multiprocessor (SM) is a light weight core compared to IA core.

Light weight PE:Fused Multiply Add (FMA)

SFU:Special Function Unit

Page 35: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

NVIDIA's Motivation of Simple Core

"This [multiple IA-core] approach is analogous to trying to build an airplane by putting wings on a train."

--Bill Dally, NVIDIA

Page 36: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Review: How Do We Reach Here?NVIDIA Fermi, 512 Processing Elements (PEs)

Page 37: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Throughput Oriented Architectures

1. Fine-grained interleaved threading (~2x comp density)2. SIMD/SIMT (>10x comp density)3. Simple core (~2x comp density)

Key architectural features of throughput oriented processor.

ref: Michael Garland. David B. Kirk, "Understanding throughput-oriented architectures", CACM 2010. (link)

Page 38: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Today's Topics

• GPU architecture• GPU programming• GPU micro-architecture• Performance optimization and model• Trends

Page 39: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

CUDA ProgrammingMassive number (>10000) of light-weight threads.

Page 40: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Express Data Parallelism in Threads 

Compare thread program with vector program.

Page 41: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Vector Program

Scalar program float A[4][8];do-all(i=0;i<4;i++){    do-all(j=0;j<8;j++){        A[i][j]++;     }}

Vector program (vector width of 8)

float A[4][8];

do-all(i=0;i<4;i++){    movups xmm0, [ &A[i][0] ]    incps xmm0    movups [ &A[i][0] ], xmm0} 

Vector width is exposed to programmers.

Page 42: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

CUDA Program

Scalar program float A[4][8];do-all(i=0;i<4;i++){    do-all(j=0;j<8;j++){        A[i][j]++;     }}

CUDA program

float A[4][8];  kernelF<<<(4,1),(8,1)>>>(A); __device__    kernelF(A){    i = blockIdx.x;    j = threadIdx.x;    A[i][j]++;} 

• CUDA program expresses data level parallelism (DLP) in terms of thread level parallelism (TLP).

• Hardware converts TLP into DLP at run time.

Page 43: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Two Levels of Thread HierarchykernelF<<<(4,1),(8,1)>>>(A); __device__    kernelF(A){    i = blockIdx.x;    j = threadIdx.x;    A[i][j]++;}

 

Page 44: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Multi-dimension Thread and Block ID

kernelF<<<(2,2),(4,2)>>>(A); __device__    kernelF(A){    i = blockDim.x * blockIdx.y        + blockIdx.x;    j = threadDim.x * threadIdx.y        + threadIdx.x;    A[i][j]++;}

 

Both grid and thread block can have two dimensional index.

Page 45: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Scheduling Thread Blocks on SMExample:Scheduling 4 thread blocks on 3 SMs.

Page 46: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Executing Thread Block on SM

Executed on machine with width of 4:

Executed on machine with width of 8:

Notes: the number of Processing Elements (PEs) is transparent to programmer.

kernelF<<<(2,2),(4,2)>>>(A); __device__    kernelF(A){    i = blockDim.x * blockIdx.y        + blockIdx.x;    j = threadDim.x * threadIdx.y        + threadIdx.x;    A[i][j]++;}

 

Page 47: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Multiple Levels of Memory HierarchyName Cache? cycle read-only?

Global L1/L2 200~400 (cache miss) R/W

Shared No 1~3 R/W

Constant Yes 1~3 Read-only

Texture Yes ~100 Read-only

Local L1/L2 200~400 (cache miss) R/W

Page 48: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Explicit Management of Shared MemShared memory is frequently used to exploit locality.

Page 49: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Shared Memory and Synchronization

kernelF<<<(1,1),(16,16)>>>(A); __device__    kernelF(A){    __shared__ smem[16][16]; //allocate smem    i = threadIdx.y;    j = threadIdx.x;    smem[i][j] = A[i][j];    __sync();    A[i][j] = ( smem[i-1][j-1]                   + smem[i-1][j]                   ...                   + smem[i+1][i+1] ) / 9;}

 

Example: average filter with 3x3 window

3x3 window on image

Image data in DRAM

Page 50: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Shared Memory and Synchronization

kernelF<<<(1,1),(16,16)>>>(A); __device__    kernelF(A){    __shared__ smem[16][16];    i = threadIdx.y;    j = threadIdx.x;    smem[i][j] = A[i][j]; // load to smem    __sync(); // thread wait at barrier    A[i][j] = ( smem[i-1][j-1]                   + smem[i-1][j]                   ...                   + smem[i+1][i+1] ) / 9;}

 

Example: average filter over 3x3 window

3x3 window on image

Stage data in shared mem

Page 51: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Shared Memory and Synchronization

kernelF<<<(1,1),(16,16)>>>(A); __device__    kernelF(A){    __shared__ smem[16][16];    i = threadIdx.y;    j = threadIdx.x;    smem[i][j] = A[i][j];    __sync(); // every thread is ready    A[i][j] = ( smem[i-1][j-1]                   + smem[i-1][j]                   ...                   + smem[i+1][i+1] ) / 9;}

 

Example: average filter over 3x3 window

3x3 window on image

all threads finish the load

Page 52: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Shared Memory and Synchronization

kernelF<<<(1,1),(16,16)>>>(A); __device__    kernelF(A){    __shared__ smem[16][16];    i = threadIdx.y;    j = threadIdx.x;    smem[i][j] = A[i][j];    __sync();    A[i][j] = ( smem[i-1][j-1]                   + smem[i-1][j]                   ...                   + smem[i+1][i+1] ) / 9;}

 

Example: average filter over 3x3 window

3x3 window on image

Start computation

Page 53: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Programmers Think in Threads

Q: Why make this hassle?

Page 54: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Why Use Thread instead of Vector?

Thread Pros:• Portability. Machine width is transparent in ISA.• Productivity. Programmers do not need to take care the

vector width of the machine.

Thread Cons:• Manual sync. Give up lock-step within vector.• Scheduling of thread could be inefficient.• Debug. "Threads considered harmful". Thread program

is notoriously hard to debug. 

Page 55: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Features of CUDA

• Programmers explicitly express DLP in terms of TLP.• Programmers explicitly manage memory hierarchy.• etc.

Page 56: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Today's Topics

• GPU architecture• GPU programming• GPU micro-architecture• Performance optimization and model• Trends

Page 57: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Micro-architectureGF100 micro-architecture

Page 58: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

HW Groups Threads Into WarpsExample: 32 threads per warp

Page 59: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Example of ImplementationNote: NVIDIA may use a more complicated implementation.

Page 60: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

ExampleProgram Address: Inst0x0004: add r0, r1, r20x0008: sub r3, r4, r5

Assume warp 0 and warp 1 are scheduled for execution.

Page 61: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Read Src OpProgram Address: Inst0x0004: add r0, r1, r20x0008: sub r3, r4, r5

Read source operands:r1 for warp 0r4 for warp 1

Page 62: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Buffer Src OpProgram Address: Inst0x0004: add r0, r1, r20x0008: sub r3, r4, r5

Push ops to op collector:r1 for warp 0r4 for warp 1

Page 63: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Read Src OpProgram Address: Inst0x0004: add r0, r1, r20x0008: sub r3, r4, r5

Read source operands:r2 for warp 0r5 for warp 1

Page 64: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Buffer Src OpProgram Address: Inst0x0004: add r0, r1, r20x0008: sub r3, r4, r5

Push ops to op collector:r2 for warp 0r5 for warp 1

Page 65: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

ExecuteProgram Address: Inst0x0004: add r0, r1, r20x0008: sub r3, r4, r5

Compute the first 16 threads in the warp.

Page 66: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

ExecuteProgram Address: Inst0x0004: add r0, r1, r20x0008: sub r3, r4, r5

Compute the last 16 threads in the warp.

Page 67: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Write backProgram Address: Inst0x0004: add r0, r1, r20x0008: sub r3, r4, r5

Write back:r0 for warp 0r3 for warp 1

Page 68: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Other High Performance GPU

• ATI Radeon 5000 series.

Page 69: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

ATI Radeon 5000 Series Architecture

Page 70: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Radeon SIMD Engine

• 16 Stream Cores (SC)• Local Data Share

Page 71: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

VLIW Stream Core (SC)

Page 72: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Local Data Share (LDS)

Page 73: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Today's Topics

• GPU architecture• GPU programming• GPU micro-architecture• Performance optimization and model• Trends

Page 74: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Performance OptimizationOptimizations on memory latency tolerance• Reduce register pressure• Reduce shared memory pressure

  Optimizations on memory bandwidth • Global memory coalesce • Avoid shared memory bank conflicts• Grouping byte access • Avoid Partition camping

 Optimizations on computation efficiency • Mul/Add balancing• Increase floating point proportion 

 Optimizations on operational intensity • Use tiled algorithm• Tuning thread granularity

Page 75: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Performance OptimizationOptimizations on memory latency tolerance• Reduce register pressure• Reduce shared memory pressure

  Optimizations on memory bandwidth • Global memory coalesce • Avoid shared memory bank conflicts• Grouping byte access • Avoid Partition camping

 Optimizations on computation efficiency • Mul/Add balancing• Increase floating point proportion 

 Optimizations on operational intensity • Use tiled algorithm• Tuning thread granularity

Page 76: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Shared Mem Contains Multiple Banks

Page 77: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Compute CapabilityNeed arch info to perform optimization.

ref: NVIDIA, "CUDA C Programming Guide", (link)

Page 78: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Shared Memory (compute capability 2.x)

withoutbankconflict:

withbankconflict:

Page 79: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Performance OptimizationOptimizations on memory latency tolerance• Reduce register pressure• Reduce shared memory pressure

  Optimizations on memory bandwidth • Global memory alignment and coalescing• Avoid shared memory bank conflicts• Grouping byte access • Avoid Partition camping

 Optimizations on computation efficiency • Mul/Add balancing• Increase floating point proportion 

 Optimizations on operational intensity • Use tiled algorithm• Tuning thread granularity

Page 80: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Global Memory In Off-Chip DRAMAddress space is interleaved among multiple channels.

Page 81: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Global Memory

Page 82: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Global Memory

Page 83: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Global Memory

Page 84: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Roofline ModelIdentify performance bottleneck: computation bound v.s. bandwidth bound

Page 85: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Optimization Is Key for Attainable Gflops/s

Page 86: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Computation, Bandwidth, LatencyIllustrating three bottlenecks in the Roofline model.

Page 87: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Today's Topics

• GPU architecture• GPU programming• GPU micro-architecture• Performance optimization and model• Trends

Page 88: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Trends

Coming architectures:• Intel's Larabee successor: Many Integrated Core (MIC)• CPU/GPU fusion, Intel Sandy Bridge, AMD Llano.

Page 89: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Intel Many Integrated Core (MIC)32 core version of MIC:

Page 90: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Intel Sandy Bridge

Highlight:• Reconfigurable shared L3

for CPU and GPU• Ring bus

Page 91: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Sandy Bridge's New CPU-GPU interface 

ref: "Intel's Sandy Bridge Architecture Exposed", from Anandtech, (link)

Page 92: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Sandy Bridge's New CPU-GPU interface 

ref: "Intel's Sandy Bridge Architecture Exposed", from Anandtech, (link)

Page 93: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

AMD Llano Fusion APU (expt. Q3 2011)

Notes:• CPU and GPU are not

sharing cache?• Unknown interface

between CPU/GPU

Page 94: Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

GPU Research in ES Group

GPU research in the Electronic Systems group.http://www.es.ele.tue.nl/~gpuattue/