Sponge: Portable Stream Programming on Graphics Engines

University of MichiganElectrical Engineering and Computer

Science

Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke

Sponge: Portable Stream Programming on Graphics Engines


Science2

Why GPUs?• Every mobile and desktop

system will have one

• Affordable and high performance

• Over-provisioned

• Programmable

Sony PlayStation Phone

2002 2003 2004 2005 2006 2007 2008 2009 2010 20110

250

500

750

1000

1250

1500

NVIDIA GPU

INTEL CPUTh

eore

tical

GFL

OPS/

s

GeForce GTX 480

GeForce GTX 280

GeForce 8800 GTX

GeForce 7800 GTX GeForce 6800 Ultra


Science3

GPU Architecture

Shared

Regs

0 1

2 3

4 56 7

Interconnection Network

CPU

SM 0 SM 1 SM 29

Kernel 1

Kernel 2

Tim

e 0 1

2 3

4 5

6 7

Shared

Regs

0 1

2 3

4 56 7

Shared

Regs

0 1

2 3

4 56 7

Registers

Global Memory (Device Memory)

Shared Memory


Science4

GPU Programming ModelPer-block Register

Grid 1

Grid 0

Per-appDevice Global

Memory

Grid Sequence

__shared__ int GlobalVar

Per-block Shared Memory

Block

__shared__ int SharedVar

int LocalVarArray[10]

int RegisterVar

Thread

Per-threadRegister

Thread

Per-threadLocal

Memory

Per-block Shared Memory

Thread

int LocalVarArray[10]

• Threads Blocks Grid

• All the threads run one kernel

• Registers private to each thread

• Registers spill to local memory

• Shared memory shared between threads of a block

• Global memory shared between all blocks


Science5

Grid 1

GPU Execution Model

SM 1

Shared

Regs

0 1

2 3

4 5

6 7

SM 0

Shared

Regs

0 1

2 3

4 5

6 7

SM 2

Shared

Regs

0 1

2 3

4 5

6 7

SM 3

Shared

Regs

0 1

2 3

4 5

6 7

SM 30

Shared

Regs

0 1

2 3

4 5

6 7


Science6

GPU Execution Model

Block 0

Block 1

Block 3

Shared

Registers

0 1

2

4 5

3

6 7

SM0

Block 2

Warp 0 Warp 1

ThreadId0 31 32 63


Science7

GPU Programming Challenges

0

50

100

150

200

250

300

350

400

64

4832

16

8

Number of Registers Per Thread

Tim

e (m

s)

High Performance Desktop Mobile

Optimized forGeForce GTX 285

Optimized forGeForce 8400 GS

• Data restructuring for complex memory hierarchy efficiently– Global memory, Shared memory, Registers

• Partitioning work between CPU and GPU

• Lack of portability between different generations of GPU– Registers, active warps, size of global

memory, size of shared memory

• Will vary even more– Newer high performance cards e.g. NVIDA’s

Fermi– Mobile GPUs with less resources


Science8

Nonlinear Optimization Space

[Ryoo , CGO ’08]

SAD Optimization Space

908 Configurations

We need higher level of abstraction!


Science9

Goals

• Write-once parallel software

• Free the programmer from low-level details

(C + Pthreads) Shared Memory Processors

(C +Intrinsics) SIMD Engines

(Verilog/VHDL) FPGAs

(CUDA/OpenCL) GPUs

Parallel Specification


Science10

Streaming

• Higher-level of abstraction

• Decoupling computation and memory accesses

• Coarse grain exposed parallelism, exposed communication

• Programmers can focus on the algorithms instead of low-level details

• Streaming actors use buffers to communicate

• A lot of recent works on extending portability of streaming applications

Actor 1

Actor 2 Actor 5

Splitter

Actor 4Actor 3

Joiner

Actor 6


Science11

Sponge– Generating optimized CUDA for a wide

variety of GPU targets

– Perform an array of optimizations on stream graphs

– Optimizing and porting to different generations

– Utilize memory hierarchy (registers, shared memory, coallescing)

– Efficiently utilize streaming cores

Reorganization and Classification

Memory Layout

Graph

Restructuring

Register Optimization

Shared/Global Memory

Helper Threads

Bank Conflict Resolution

Loop Unrolling

Software Prefetching


Science12

GPU Performance Model- Memory bound Kernels

M 0 M 1 M 2 M 3 M 4 M 5 M 6 M 7C 0 C 1 C 2 C 3 C 4 C 5 C 6 C 7

≈ Memory Time

- Computation bound Kernels

M 0 M 1 M 4 M 5M 2 M 3 M 6 M 7

C 0 C1 C 2 C 3 C 4 C 5 C 6 C 7

≈ Computation Time

M CMemory Instructions Computation Instructions


Science13

Actor Classification• High Traffic Actors(HiT)

– Large number of memory accesses per actor– Less number of threads with shared memory– Using shared memory underutilizes the processors

• Low Traffic Actors(LoT)– Less number of memory accesses per actor– More number of threads– Using shared memory increases the performance


Science14

Thread 1 Thread 2 Thread 3Thread 0

1514131211109876543210

1514131211109876543210

Global Memory Accesses

A[4,4]

Global Memory

Global Memory2 6 10 14

2 6 10 141 5 9 13

1 5 9 13

0 4 8 12

0 4 8 12

3 7 11 15

3 7 11 15

• Large access latency

• Not access the words in sequence

• No coalescing

A[4,4] A[4,4] A[4,4]

A[i, j] Actor A has i pops and j pushes


Science15

Thread 3Thread 2Thread 1Thread 0

Shared Memory

1514131211109876543210

1514131211109876543210

A[4,4] A[4,4] A[4,4] A[4,4]

Shared Memory

Shared Memory

1514131211109876543210

1514131211109876543210

Global To

Shared

Global To

Shared

Global To

Shared

Global To

Shared

Global Memory

Global Memory 3210

3210

7654

7654

111098

111098

15141312

15141312

3210

3210

7654

7654

111098

111098

15141312

15141312

Shared to

Global

Shared to

Global

Shared to

Global

Shared to

Global


Science16

Using Shared Memory

• Shared memory is 100x faster than global memory

• Coalesce all global memory accesses

• Number of threads is limited by size of the shared memory.

For number of iterationsFor number of pops

For number of pushs

Shared Global

Shared Global

syncthreads

syncthreads

End Kernel

Begin Kernel <<<Blocks, Threads>>>:

Work


For number of iterations

End Kernel

Work


Science17


syncthreads

syncthreads

If helper threads

If helper threads

If worker threads

End Kernel

Begin Kernel <<< Blocks, Threads + Helper >>>:

Work

Shared Global

Shared Global

Helper Threads

• Shared memory limits the number of threads.

• Underutilized processors can fetch data.

• All the helper threads are in one warp. (no control flow divergence)



End Kernel

Work


Science18


syncthreads

syncthreads

For number of pops

For number of pops

For number of pops

For number of pushs


End Kernel

If not the last iteration

Work

Regs Global

Regs Global

Shared Regs

Shared Global

Data Prefetch• Better register utilization

• Data for iteration i+1 is moved to registers

• Data for iteration i is moved from register to shared memory

• Allows the GPU to overlap instructions

For number of iterationsFor number of pops

For number of pushs

Shared Global

Shared Global

syncthreads

syncthreads

End Kernel


Work


Science19

Loop unrolling• Similar to traditional unrolling

• Allows the GPU to overlap instructions

• Better register utilization

• Less loop control overhead

• Can also be applied to memory transfer loops

For number of iterations/2


End Kernel

syncthreads

syncthreads

Work

Work

For number of popsShared Global

For number of popsShared Global

syncthreads

syncthreads

For number of pushsShared Global

For number of pushsShared Global


Science20

Methodology

• Set of benchmarks from the StreamIt Suite• 3GHz Intel Core 2 Duo CPU with 6GB RAM• Nvidia Geforce GTX 285

StreamProcessors

Processor clock

Memory Configuration Memory Bandwidth

240 1476 MHz 2GB DDR3 159.0 GB/s


Science21

Result (Baseline CPU)DCT

FFT

Matrix

Multipl

yMatr

ix Mult

iplyB

lock

Biton

ic

Batch

er

Radix

Merge

Sort

Compa

rision

Cou

nting

Vecto

r Add

Histog

ram

Aver

age

05

101520253035404550

With Transfer Without Transfer

Spee

dup(

X)

10

24


Science22

Result (Baseline GPU)DCT

FFT

Matrix M

ultiply

Matrix M

ultiply

Block

Biton

ic

Batch

er

Radix

Merge S

ortCom

paris

ion C

ounti

ng

Vecto

r Add

Histogra

m

Avera

ge

0

1

2

3

4

5

6

7

Shared/Global Prefetch/Unrolling Helper Threads Graph Restructuring

Spee

dup(

X)

64%

3%16%16%


Science23

Conclusion• Future systems will be heterogeneous

• GPUs are important part of such systems

• Programming complexity is a significant challenge

• Sponge automatically creates optimized CUDA code for a wide variety of GPU targets

• Provide portability by performing an array of optimizations on stream graphs


Science24

Questions


Science25

Spatial Intermediate Representation

• StreamIt• Main Constructs:

– Filter Encapsulate computation.

– Pipeline Expressing pipeline parallelism.– Splitjoin Expressing task-level parallelism.– Other constructs not relevant here

• Exposes different types of parallelism– Composable, hierarchical

• Stateful and stateless filters

pipeline

filter

splitjoin


Science26

Nonlinear Optimization Space

[Ryoo , CGO ’08]

SAD Optimization Space

908 Configurations


Science27

Thread 1 Thread 2Thread 0

Bank Conflict

765432101514131211109876543210

A[8,8] A[8,8] A[8,8]

Shared Memory

Shared Memory 765432101514131211109876543210

Conflict

0 8 0

0 8 0

1 9 1

1 9 1

2 10 2

2 10 2

27

data = buffer[BaseAddress + s * ThreadId]


Science28

Thread 2Thread 1Thread 0

Removing Bank Conflict

765432101514131211109876543210

A[8,8] A[8,8] A[8,8]

Shared Memory

Shared Memory 765432101514131211109876543210

0 9 2

0 9 2

1 10 3

1 10 3

2 11 4

2 11 4

28

data = buffer[BaseAddress + s * ThreadId]

if GCD( # of bank, s) is 1 there will be no bank conflict s must be odd

Documents

Sponge: Portable Stream Programming on Graphics Engines