Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Graphics Processing Unit (GPU)Architecture and Programming

TU/e 5kk73Zhenyu Ye

Bart MesmanHenk Corporaal

2010-11-08

Today's Topics

• GPU architecture• GPU programming• GPU micro-architecture• Performance optimization and model• Trends

Today's Topics


System Architecture

GPU ArchitectureNVIDIA Fermi, 512 Processing Elements (PEs)

What Can It Do?Render triangles.

NVIDIA GTX480 can render 1.6 billion triangles per second!

General Purposed Computing

ref: http://www.nvidia.com/object/tesla_computing_solutions.html

The Vision of NVIDIA"Within the next few years, there will be single-chip graphics

devices more powerful and versatile than any graphics system that has ever been built, at any price."

-- David Kirk, NVIDIA, 1998

ref: http://www.llnl.gov/str/JanFeb05/Seager.html

Single-Chip GPU v.s. Fastest Super Computers

Top500 Super Computer in June 2010

GPU Will Top the List in Nov 2010

The Gap Between CPU and GPU

ref: Tesla GPU Computing Brochure

GPU Has 10x Comp Density

Given the same chip area, the achievable performance of GPU is 10x higher than that of CPU.

Evolution of Intel PentiumPentium I Pentium II

Pentium III Pentium IV

Chip areabreakdown

Q: What can you observe? Why?

Extrapolation of Single Core CPUIf we extrapolate the trend, in a few generations, Pentium will look like:

Of course, we know it did not happen.

Q: What happened instead? Why?

Evolution of Multi-core CPUsPenryn Bloomfield

Gulftown Beckton

Chip areabreakdown

Q: What can you observe? Why?

Let's Take a Closer Look

Less than 10% of total chip area is used for the real execution.

Q: Why?

The Memory Hierarchy

Notes on Energy at 45nm: 64-bit Int ADD takes about 1 pJ.64-bit FP FMA takes about 200 pJ.

It seems we can not further increase the computational density.

The Brick Wall -- UC Berkeley's ViewPower Wall: power expensive, transistors freeMemory Wall: Memory slow, multiplies fastILP Wall: diminishing returns on more ILP HW

David Patterson, "Computer Architecture is Back - The Berkeley View of the Parallel Computing Research Landscape", Stanford EE Computer Systems Colloquium, Jan 2007, link

The Brick Wall -- UC Berkeley's ViewPower Wall: power expensive, transistors freeMemory Wall: Memory slow, multiplies fastILP Wall: diminishing returns on more ILP HW

Power Wall + Memory Wall + ILP Wall = Brick Wall

David Patterson, "Computer Architecture is Back - The Berkeley View of the Parallel Computing Research Landscape", Stanford EE Computer Systems Colloquium, Jan 2007, link

How to Break the Brick Wall?

Hint: how to exploit the parallelism inside the application?

Step 1: Trade Latency with Throughput

Hind the memory latency through fine-grained interleaved threading.

Interleaved Multi-threading


The granularity of interleaved multi-threading:• 100 cycles: hide off-chip memory latency• 10 cycles: + hide cache latency• 1 cycle: + hide branch latency, instruction dependency



Fine-grained interleaved multi-threading:Pros: ?Cons: ?



Fine-grained interleaved multi-threading:Pros: remove branch predictor, OOO scheduler, large cacheCons: register pressure, etc.

Fine-Grained Interleaved Threading

Pros: reduce cache size,no branch predictor, no OOO scheduler

Cons: register pressure,thread scheduler,require huge parallelism

Without and with fine-grained interleaved threading

HW SupportRegister file supports zero overhead context switch between interleaved threads.

Can We Make Further Improvement?

Reducing large cache gives 2x computational density.

Q: Can we make further improvements?

Hint:We have only utilized thread level parallelism (TLP) so far.

Step 2: Single Instruction Multiple Data

SSE has 4 data lanes GPU has 8/16/24/... data lanes

GPU uses wide SIMD: 8/16/24/... processing elements (PEs)CPU uses short SIMD: usually has vector width of 4.

Hardware SupportSupporting interleaved threading + SIMD execution

Single Instruction Multiple Thread (SIMT)

Hide vector width using scalar threads.

Example of SIMT ExecutionAssume 32 threads are grouped into one warp.

Step 3: Simple Core

The Stream Multiprocessor (SM) is a light weight core compared to IA core.

Light weight PE:Fused Multiply Add (FMA)

SFU:Special Function Unit

NVIDIA's Motivation of Simple Core

"This [multiple IA-core] approach is analogous to trying to build an airplane by putting wings on a train."

--Bill Dally, NVIDIA

Review: How Do We Reach Here?NVIDIA Fermi, 512 Processing Elements (PEs)

Throughput Oriented Architectures

1. Fine-grained interleaved threading (~2x comp density)2. SIMD/SIMT (>10x comp density)3. Simple core (~2x comp density)

Key architectural features of throughput oriented processor.

ref: Michael Garland. David B. Kirk, "Understanding throughput-oriented architectures", CACM 2010. (link)

Today's Topics


CUDA ProgrammingMassive number (>10000) of light-weight threads.

Express Data Parallelism in Threads

Compare thread program with vector program.

Vector Program

Scalar program float A[4][8];do-all(i=0;i<4;i++){ do-all(j=0;j<8;j++){ A[i][j]++; }}

Vector program (vector width of 8)

float A[4][8];

do-all(i=0;i<4;i++){ movups xmm0, [ &A[i][0] ] incps xmm0 movups [ &A[i][0] ], xmm0}

Vector width is exposed to programmers.

CUDA Program

Scalar program float A[4][8];do-all(i=0;i<4;i++){ do-all(j=0;j<8;j++){ A[i][j]++; }}

CUDA program

float A[4][8]; kernelF<<<(4,1),(8,1)>>>(A); __device__ kernelF(A){ i = blockIdx.x; j = threadIdx.x; A[i][j]++;}

• CUDA program expresses data level parallelism (DLP) in terms of thread level parallelism (TLP).

• Hardware converts TLP into DLP at run time.

Two Levels of Thread HierarchykernelF<<<(4,1),(8,1)>>>(A); __device__ kernelF(A){ i = blockIdx.x; j = threadIdx.x; A[i][j]++;}

Multi-dimension Thread and Block ID

kernelF<<<(2,2),(4,2)>>>(A); __device__ kernelF(A){ i = blockDim.x * blockIdx.y + blockIdx.x; j = threadDim.x * threadIdx.y + threadIdx.x; A[i][j]++;}

Both grid and thread block can have two dimensional index.

Scheduling Thread Blocks on SMExample:Scheduling 4 thread blocks on 3 SMs.

Executing Thread Block on SM

Executed on machine with width of 4:

Executed on machine with width of 8:

Notes: the number of Processing Elements (PEs) is transparent to programmer.

kernelF<<<(2,2),(4,2)>>>(A); __device__ kernelF(A){ i = blockDim.x * blockIdx.y + blockIdx.x; j = threadDim.x * threadIdx.y + threadIdx.x; A[i][j]++;}

Multiple Levels of Memory HierarchyName Cache? cycle read-only?

Global L1/L2 200~400 (cache miss) R/W

Shared No 1~3 R/W

Constant Yes 1~3 Read-only

Texture Yes ~100 Read-only

Local L1/L2 200~400 (cache miss) R/W

Explicit Management of Shared MemShared memory is frequently used to exploit locality.

Shared Memory and Synchronization

kernelF<<<(1,1),(16,16)>>>(A); __device__ kernelF(A){ __shared__ smem[16][16]; //allocate smem i = threadIdx.y; j = threadIdx.x; smem[i][j] = A[i][j]; __sync(); A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] ... + smem[i+1][i+1] ) / 9;}

Example: average filter with 3x3 window

3x3 window on image

Image data in DRAM


kernelF<<<(1,1),(16,16)>>>(A); __device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; j = threadIdx.x; smem[i][j] = A[i][j]; // load to smem __sync(); // thread wait at barrier A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] ... + smem[i+1][i+1] ) / 9;}

Example: average filter over 3x3 window

3x3 window on image

Stage data in shared mem


kernelF<<<(1,1),(16,16)>>>(A); __device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; j = threadIdx.x; smem[i][j] = A[i][j]; __sync(); // every thread is ready A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] ... + smem[i+1][i+1] ) / 9;}


3x3 window on image

all threads finish the load


kernelF<<<(1,1),(16,16)>>>(A); __device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; j = threadIdx.x; smem[i][j] = A[i][j]; __sync(); A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] ... + smem[i+1][i+1] ) / 9;}


3x3 window on image

Start computation

Programmers Think in Threads

Q: Why make this hassle?

Why Use Thread instead of Vector?

Thread Pros:• Portability. Machine width is transparent in ISA.• Productivity. Programmers do not need to take care the

vector width of the machine.

Thread Cons:• Manual sync. Give up lock-step within vector.• Scheduling of thread could be inefficient.• Debug. "Threads considered harmful". Thread program

is notoriously hard to debug.

Features of CUDA

• Programmers explicitly express DLP in terms of TLP.• Programmers explicitly manage memory hierarchy.• etc.

Today's Topics


Micro-architectureGF100 micro-architecture

HW Groups Threads Into WarpsExample: 32 threads per warp

Example of ImplementationNote: NVIDIA may use a more complicated implementation.

ExampleProgram Address: Inst0x0004: add r0, r1, r20x0008: sub r3, r4, r5

Assume warp 0 and warp 1 are scheduled for execution.

Read Src OpProgram Address: Inst0x0004: add r0, r1, r20x0008: sub r3, r4, r5

Read source operands:r1 for warp 0r4 for warp 1

Buffer Src OpProgram Address: Inst0x0004: add r0, r1, r20x0008: sub r3, r4, r5

Push ops to op collector:r1 for warp 0r4 for warp 1

Read Src OpProgram Address: Inst0x0004: add r0, r1, r20x0008: sub r3, r4, r5

Read source operands:r2 for warp 0r5 for warp 1

Buffer Src OpProgram Address: Inst0x0004: add r0, r1, r20x0008: sub r3, r4, r5

Push ops to op collector:r2 for warp 0r5 for warp 1

ExecuteProgram Address: Inst0x0004: add r0, r1, r20x0008: sub r3, r4, r5

Compute the first 16 threads in the warp.

ExecuteProgram Address: Inst0x0004: add r0, r1, r20x0008: sub r3, r4, r5

Compute the last 16 threads in the warp.

Write backProgram Address: Inst0x0004: add r0, r1, r20x0008: sub r3, r4, r5

Write back:r0 for warp 0r3 for warp 1

Other High Performance GPU

• ATI Radeon 5000 series.

ATI Radeon 5000 Series Architecture

Radeon SIMD Engine

• 16 Stream Cores (SC)• Local Data Share

VLIW Stream Core (SC)

Local Data Share (LDS)

Today's Topics


Performance OptimizationOptimizations on memory latency tolerance• Reduce register pressure• Reduce shared memory pressure

Optimizations on memory bandwidth • Global memory coalesce • Avoid shared memory bank conflicts• Grouping byte access • Avoid Partition camping

Optimizations on computation efficiency • Mul/Add balancing• Increase floating point proportion

Optimizations on operational intensity • Use tiled algorithm• Tuning thread granularity


Optimizations on memory bandwidth • Global memory coalesce • Avoid shared memory bank conflicts• Grouping byte access • Avoid Partition camping



Shared Mem Contains Multiple Banks

Compute CapabilityNeed arch info to perform optimization.

ref: NVIDIA, "CUDA C Programming Guide", (link)

Shared Memory (compute capability 2.x)

withoutbankconflict:

withbankconflict:


Optimizations on memory bandwidth • Global memory alignment and coalescing• Avoid shared memory bank conflicts• Grouping byte access • Avoid Partition camping



Global Memory In Off-Chip DRAMAddress space is interleaved among multiple channels.

Global Memory

Global Memory

Global Memory

Roofline ModelIdentify performance bottleneck: computation bound v.s. bandwidth bound

Optimization Is Key for Attainable Gflops/s

Computation, Bandwidth, LatencyIllustrating three bottlenecks in the Roofline model.

Today's Topics


Trends

Coming architectures:• Intel's Larabee successor: Many Integrated Core (MIC)• CPU/GPU fusion, Intel Sandy Bridge, AMD Llano.

Intel Many Integrated Core (MIC)32 core version of MIC:

Intel Sandy Bridge

Highlight:• Reconfigurable shared L3

for CPU and GPU• Ring bus

Sandy Bridge's New CPU-GPU interface

ref: "Intel's Sandy Bridge Architecture Exposed", from Anandtech, (link)

Sandy Bridge's New CPU-GPU interface

ref: "Intel's Sandy Bridge Architecture Exposed", from Anandtech, (link)

AMD Llano Fusion APU (expt. Q3 2011)

Notes:• CPU and GPU are not

sharing cache?• Unknown interface

between CPU/GPU

GPU Research in ES Group

GPU research in the Electronic Systems group.http://www.es.ele.tue.nl/~gpuattue/

Documents

Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08