ipdps v3 0 (2).pptx [Read-Only] - NUS Computingwongwf/papers/IPDPS11_ppt.pdfMicrosoft PowerPoint - ipdps v3 0 (2).pptx [Read-Only] Author dcswwf Created Date 5/30/2011 2:52:51 PM

2

TuneTransform

GPU

KernelsBlocks

Warps

Programming model

Application

3

GPU architecture• SIMD constraints• Memory hierarchy

Automated mapping• GPU model• Program transformations

Application

TuneTransform

GPU

KernelsBlocks

Warps

Programming model

Application

KernelsBlocks

Warps

Programming model

Our approach

Stream graph Parallel instances of the entire graph

4

from memory

to memory

Stream graph Parallel instances of the entire graphNovel memory access scheme

5

from memory

to memory

Stream graph Parallel instances of the entire graphNovel memory access schemeUtilize fine-grained parallelism

6

GPU threads

Hierarchical stream graph◦ Well defined rates◦ Pipeline◦ Splitters / Joiners◦ Mostly stateless filters

StreamIt compiler◦ Schedules◦ Flattens◦ Analyzes

Peeking◦ Alternative to filters with state◦ Allows access to input consumed

by future iteration

7

Stream graph

pipeline

splitter

joiner

Udupa et al. (CGO 2009)◦ Software pipelined execution

of stream programs on GPUsno memory prefetchingpipeline computation

Hormati et al. (ASPLOS 2011)◦ Sponge: Portable Stream

Programming on Graphics Enginesmemory access schemememory traffic not fully optimized

(filters fused partially)no compression on multiple threads

8

Global GM memory◦ Large, slow, visible by threads on all SM processorsShared SM memory◦ Small, fast, visible by threads on one SM processor

Other memories, subset of GM◦ Local memory – for registers spilling◦ Texture memory – different access pattern

9

CPU input GM memory

SM memory

SM memory

GPUCompute

kernelCPU output

Array of SIMD processors (SM processors)◦ Each handles a large pool of threads grouped in warpsInterleaved execution for high throughput◦ operation latency (22 cycles)◦ memory latency (~ 400 cycles)

10

Core 1 Core 2 Core Nwarp 0warp 1warp 2

warp 0warp 1warp 2

11

Vicious cycle

Influenced by ratio of:◦ Memory access◦ Computation

Bandwidth limitation

400 cycles!


13

Misconception: Threads should not diverge◦ True only for threads belonging to a warp

Warp:◦ SIMD execution model◦ Static thread allocation (based on thread ID)

No penalty for this CUDA code:if (threadIdx.x < warpSize) {

compute_action();} else if (threadIdx.x >= warpSize && threadIdx.x < 2 * warpSize)

memory_access_action();} else if …

14

Insufficient memory access warps limit performance

15

S1070 GPU

G8800 GPU

Insufficient to sustain required bandwidth

Additional warps for memory access

(Bitonic sort)

Software pipelining a group of iterations in each SM processor

I/O streams in GM memory

16

INi-1 INi

OUTi-1 OUTi OUTi+1

INi+1

SM processorSM memory

GM memory



Double buffer (DB) for I/O exchange

17

DB

SM processor

INi-1 INi

OUTi-1 OUTi OUTi+1

INi+1

SM memory

GM memory



Double buffer (DB) for I/O exchange

Workset (WS) must fit in SM memory◦ I/O data◦ All intermediate stream data

Limited number of iterations

18

WS DB

OUTi INi

SM processor

INi-1 INi

OUTi-1 OUTi OUTi+1

INi+1

SM memory

GM memory

Prefetching◦ 3x COMPUTE warps◦ 3x WS memory

warp 0

warp 1

warp 2

LOAD COMPUTE STORE

LOAD COMPUTE STORE

LOAD COMPUTE STORE

time

Prefetching◦ 3x COMPUTE warps◦ 3x WS memory

Specialization◦ 1x COMPUTE warp◦ 1x (WS+DB) memory

20

time

warp 0

warp 1

warp 2

LOAD COMPUTE STORE

LOAD

COMPUTE

STORE

COMPUTE COMPUTE

LOAD COMPUTE STORE

LOAD COMPUTE STORE

LOAD

STORE

LOAD

STORE

warp 0

warp 1

warp 2


21

GPU threads

Multiple threads / stream iteration◦ Distributed schedule

Synchronization◦ Lock-step execution of warp threads

22

0 1 32 4 5

6 7 8 9

0 1 2 3 4 5

6 7 8 9

23

(FilterBank)

2 threads / iteration

1 thread / iteration

4 threads / iteration

G8800 GPU

S1070 GPU

24

Workset size

SM memoryIterations

I/O size

GM latency Mem. access threads

Filter execution schedule

GPU SIMD constraints

Threads / iteration

x Compute warps

Mem. access warpsadjust to full warp

adjust to full warp

25

Stream program

StreamIt compiler

Schedule Operators

Worksetlayout

GPU kernel mapping

Inject code for CUDA execution

Target GPU specifications

Mapping configuration

GPU compiler

Kernel loader

Operator code

26

Stream program

StreamIt compiler

Schedule Operators

Worksetlayout

GPU kernel mapping

Inject code for CUDA execution

Target GPU specifications

Mapping configuration

GPU compiler

Kernel loader

Operator code

Fragmentation of workset allocation◦ Small buffers are required between filters

Liveness analysis estimation of workset size

Fragmentation actual allocation may lead to a slight increase

Coalesced memory access

27

Inter-iteration dependencies

Overlap input data to reconstruct the initial elements◦ For each SM processor◦ For each parallel thread group

Intuition:◦ Warm-up intermediate buffers◦ Threads access previous iterations

Custom synchronization ◦ Only between compute threads◦ Implemented custom barrier

28

29

versus CGO 2009

30

Different GPU architecturesversus CGO 2009

Novel scheme to execute stream graphs on GPU

Automatic heuristic for selecting efficient design points

Novel memory access scheme through software pipelining

31

Questions?

32

Documents

ipdps v3 0 (2).pptx [Read-Only] - NUS Computingwongwf/papers/IPDPS11_ppt.pdfMicrosoft PowerPoint - ipdps v3 0 (2).pptx [Read-Only] Author dcswwf Created Date 5/30/2011 2:52:51 PM