97
1 WIREFRAME: Supporting Data-dependent Parallelism through Dependency Graph Execution in GPUs AmirAli Abdolrashidi , Devashree Tripathy , Mehmet E. Belviranli , Laxmi N. Bhuyan , Daniel Wong University of California Riverside Oak Ridge National Laboratory

WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �1

WIREFRAME: Supporting Data-dependent Parallelism through

Dependency Graph Execution in GPUs

AmirAli Abdolrashidi†, Devashree Tripathy†, Mehmet E. Belviranli‡,

Laxmi N. Bhuyan†, Daniel Wong†

†University of California Riverside

‡Oak Ridge National Laboratory

Page 2: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �2

Introduction

Page 3: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �2

Introduction

Page 4: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �3

Motivation

• Despitethesupportforparallelism,GPUslacksupportfordata-dependentparallelism.

Page 5: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �4

Example: Wavefront Pattern

31

1 2 1

10

10 11

3 12

Barrier

Threadblock

Page 6: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �4

Example: Wavefront Pattern

31

1 2 1

10

10 11

3 12

Page 7: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �4

Example: Wavefront Pattern

31

1 2 1

10

10 11

3 12…untiltheapplicationends

Page 8: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �5

Example

GlobalBarriers(Original)

fori=1tonWave:-KernelLaunch-Synchronize

Page 9: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �5

Example

GlobalBarriers(Original)

fori=1tonWave:-KernelLaunch-Synchronize

Enormoushost-sidekernellaunchoverhead!

Page 10: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �5

Example

GlobalBarriers(Original)

fori=1tonWave:-KernelLaunch-Synchronize

Enormoushost-sidekernellaunchoverhead!

Waitingonnon-parentthreadblocks

Page 11: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �5

Example

GlobalBarriers(Original)

fori=1tonWave:-KernelLaunch-Synchronize

Enormoushost-sidekernellaunchoverhead!

Waitingonnon-parentthreadblocks

Page 12: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �5

Example

GlobalBarriers(Original)

fori=1tonWave:-KernelLaunch-Synchronize

Enormoushost-sidekernellaunchoverhead!

Waitingonnon-parentthreadblocks

Page 13: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �5

Example

GlobalBarriers(Original)

fori=1tonWave:-KernelLaunch-Synchronize

Enormoushost-sidekernellaunchoverhead!

Waitingonnon-parentthreadblocks

Page 14: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �6

Example

CDP(Nested)

RUN:-ParentKernelLaunch-Synchronize

ParentKernel:fori=1tonWaves: -ChildKernelLaunch-Synchronize

Page 15: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �6

Example

CDP(Nested)

KernelExecutionPattern

RUN:-ParentKernelLaunch-Synchronize

ParentKernel:fori=1tonWaves: -ChildKernelLaunch-Synchronize

Page 16: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �6

Example

CDP(Nested)

KernelExecutionPattern

RUN:-ParentKernelLaunch-Synchronize

ParentKernel:fori=1tonWaves: -ChildKernelLaunch-Synchronize

Page 17: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �6

Example

CDP(Nested)

KernelExecutionPattern

RUN:-ParentKernelLaunch-Synchronize

ParentKernel:fori=1tonWaves: -ChildKernelLaunch-Synchronize

Page 18: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �6

Example

CDP(Nested)

KernelExecutionPattern

RUN:-ParentKernelLaunch-Synchronize

ParentKernel:fori=1tonWaves: -ChildKernelLaunch-Synchronize

Page 19: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �6

Example

CDP(Nested)

KernelExecutionPattern

RUN:-ParentKernelLaunch-Synchronize

ParentKernel:fori=1tonWaves: -ChildKernelLaunch-Synchronize

Page 20: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �6

Example

CDP(Nested)

KernelExecutionPattern

RUN:-ParentKernelLaunch-Synchronize

ParentKernel:fori=1tonWaves: -ChildKernelLaunch-Synchronize

Page 21: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �6

Example

CDP(Nested)

KernelExecutionPattern

RUN:-ParentKernelLaunch-Synchronize

ParentKernel:fori=1tonWaves: -ChildKernelLaunch-Synchronize

Page 22: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �6

Example

CDP(Nested)

KernelExecutionPattern

RUN:-ParentKernelLaunch-Synchronize

ParentKernel:fori=1tonWaves: -ChildKernelLaunch-Synchronize

Page 23: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �7

Example

CDP(Nested) • Nomorehost-sidekernellaunch

• Device-sidekernellaunchstillhassignificantoverhead

• NOmulti-parentdependencysupport

• StillNOgeneraldependencysupport!

RUN:-ParentKernelLaunch-Synchronize

ParentKernel:fori=1tonWaves: -ChildKernelLaunch-Synchronize

Page 24: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �8

Motivation

• Thereisaneedforageneralizedsupportforfiner-graininter-blockdatadependencyformoreperformanceandefficiency.

Intra-Block Global Inter-BlockThreadThreadBlockBarrier

c

Page 25: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �9

Motivation

•Currentlimitations• Highdevice-sidekernellaunchoverhead• Nogeneralinter-blockdatadependencysupport

Page 26: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �10

Wireframe Overview

Host(CPU)

Device(GPU)

Page 27: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �10

Wireframe Overview

Host(CPU)

Device(GPU)

#define parent1 dim3 (blockIdx.x-1, blockIdx.y, blockIdx.z);#define parent2 dim3 (blockIdx.x, blockIdx.y-1, blockIdx.z);void* DepLink() {

if (blockIdx.x > 0) WF::AddDependency(parent1);

if (blockIdx.y > 0) WF::AddDependency(parent2);}int main() {

kernel<<<GridSize, BlockSize, DepLink>>>(0, args);}__WF__ void kernel(args) {

processWave();}

ProgrammingModel

Page 28: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �10

Wireframe Overview

Host(CPU)

Device(GPU)

#define parent1 dim3 (blockIdx.x-1, blockIdx.y, blockIdx.z);#define parent2 dim3 (blockIdx.x, blockIdx.y-1, blockIdx.z);void* DepLink() {

if (blockIdx.x > 0) WF::AddDependency(parent1);

if (blockIdx.y > 0) WF::AddDependency(parent2);}int main() {

kernel<<<GridSize, BlockSize, DepLink>>>(0, args);}__WF__ void kernel(args) {

processWave();}

ProgrammingModel DependencyGraph

Page 29: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �10

Wireframe Overview

Host(CPU)

Device(GPU)

#define parent1 dim3 (blockIdx.x-1, blockIdx.y, blockIdx.z);#define parent2 dim3 (blockIdx.x, blockIdx.y-1, blockIdx.z);void* DepLink() {

if (blockIdx.x > 0) WF::AddDependency(parent1);

if (blockIdx.y > 0) WF::AddDependency(parent2);}int main() {

kernel<<<GridSize, BlockSize, DepLink>>>(0, args);}__WF__ void kernel(args) {

processWave();}

ProgrammingModel DependencyGraph ConverttoCSR

NodeArray

EdgeArray

Page 30: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �10

Wireframe Overview

Host(CPU)

Device(GPU)

#define parent1 dim3 (blockIdx.x-1, blockIdx.y, blockIdx.z);#define parent2 dim3 (blockIdx.x, blockIdx.y-1, blockIdx.z);void* DepLink() {

if (blockIdx.x > 0) WF::AddDependency(parent1);

if (blockIdx.y > 0) WF::AddDependency(parent2);}int main() {

kernel<<<GridSize, BlockSize, DepLink>>>(0, args);}__WF__ void kernel(args) {

processWave();}

ProgrammingModel DependencyGraph

GlobalMemory

GlobalNodeArray

GlobalEdgeArray

ConverttoCSR

NodeArray

EdgeArray

Page 31: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �10

Wireframe Overview

Host(CPU)

Device(GPU)

#define parent1 dim3 (blockIdx.x-1, blockIdx.y, blockIdx.z);#define parent2 dim3 (blockIdx.x, blockIdx.y-1, blockIdx.z);void* DepLink() {

if (blockIdx.x > 0) WF::AddDependency(parent1);

if (blockIdx.y > 0) WF::AddDependency(parent2);}int main() {

kernel<<<GridSize, BlockSize, DepLink>>>(0, args);}__WF__ void kernel(args) {

processWave();}

ProgrammingModel DependencyGraph

GlobalMemory

GlobalNodeArray

GlobalEdgeArrayPendingUpdateBuffer

DATSHardware(DependencyGraphBuffer)

LocalEdgeArray

LocalNodeArray

NodeInsertionBuffer

ConverttoCSR

NodeArray

EdgeArray

Page 32: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �10

Wireframe Overview

Host(CPU)

Device(GPU)

#define parent1 dim3 (blockIdx.x-1, blockIdx.y, blockIdx.z);#define parent2 dim3 (blockIdx.x, blockIdx.y-1, blockIdx.z);void* DepLink() {

if (blockIdx.x > 0) WF::AddDependency(parent1);

if (blockIdx.y > 0) WF::AddDependency(parent2);}int main() {

kernel<<<GridSize, BlockSize, DepLink>>>(0, args);}__WF__ void kernel(args) {

processWave();}

ProgrammingModel DependencyGraph

GlobalMemory

GlobalNodeArray

GlobalEdgeArrayPendingUpdateBuffer

DATSHardware(DependencyGraphBuffer)

LocalEdgeArray

LocalNodeArray

NodeInsertionBuffer

TBScheduler

ConverttoCSR

NodeArray

EdgeArray

Page 33: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �11

Programming Model

• NewfunctionsareneededtosupportdependencyinCUDA• Adddependency• Policysettings

• ProposingDepLinksmodel• Wouldassignadependencygraphgenerationfunctiontoakernel• Easytolearnanduse

Page 34: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �12

0

4 1

8 5 2

12 9 6

13 10 7

14 11

15

3

Wireframe Pseudo-code

parent1:=(X-1,Y)parent2:=(X,Y-1)

RUN:-KernelLaunch(DepLinks)

DepLinks@BLOCK(X,Y):-AddDependency(parent1)-AddDependency(parent2)

XY

(0,0)

Page 35: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �12

0

4 1

8 5 2

12 9 6

13 10 7

14 11

15

3

Wireframe Pseudo-code

parent1:=(X-1,Y)parent2:=(X,Y-1)

RUN:-KernelLaunch(DepLinks)

DepLinks@BLOCK(X,Y):-AddDependency(parent1)-AddDependency(parent2)

6

5 2

parent1 parent2

XY

(0,0)

Page 36: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �12

0

4 1

8 5 2

12 9 6

13 10 7

14 11

15

3

Wireframe Pseudo-code

parent1:=(X-1,Y)parent2:=(X,Y-1)

RUN:-KernelLaunch(DepLinks)

DepLinks@BLOCK(X,Y):-AddDependency(parent1)-AddDependency(parent2) Onekernellaunch!

6

5 2

parent1 parent2

XY

(0,0)

Page 37: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �13

Dependency Graph

• Parentcountandlevelofeverynodedeterminedatruntime

• SenttotheGPU’sglobalmemory

Page 38: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �14

Node Renaming

• Tominimizedatalevelrangeinthebuffers

Level0

Level1

Level2

Level3

Level4

Level5

Page 39: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �15

Dependency-Aware TB Scheduler (DATS)

• Threadblockscheduler• Issuestherelevantthreadblockatthetimeforexecutionbasedonthedependencygraph

• DependencyGraphBuffer(DGB)• Cachedatafromglobalmemory• Challenge:Efficientcachinganddatautilization

Page 40: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �16

Dependency-Aware TB Scheduler (DATS)

Datastoredincompressedsparse(CSR)format• Toreducememoryusage

• Threadblocksà NodeArray• Dependenciesà EdgeArray

• spacecomplexity

Page 41: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �17

DATS Overview

0 2 4 6 7 9 11 12 14

1 2 3 4 4 5 6 6 7 7 8 9 8 9 9

GlobalNodeArray

GlobalEdgeArray

GLOBALMEMORYGlobalEdgeStart

GlobalNodeID

0 1 2 3 4 5 6 7 8 9 10 11 1312 14

0 1 2 3 4 5 6 7 8

EdgeStartGlobalNodeID

ParentCounter Level

32bits 16bits 16bits 16bits

BasePointer

TranslationofGlobalEdgeStarttoLocalEdgeStart𝐿𝐸𝑆𝑖 = (𝐺𝐸𝑆𝑖)% 𝐸𝑑𝑔𝑒𝐴𝑟𝑟𝑎𝑦

0 2 3 40 1 2 3

6 7 7 8 8 5 6

DEPENDENCYGRAPHBUFFER(DGB)LocalNodeArray

LocalEdgeArray

LocalEdgeStart

GlobalNodeIDH

0 1 2 3 4 5 6

PendingUpdateBuffer

NodeInsertionBuffer

Page 42: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �17

DATS Overview

0 2 4 6 7 9 11 12 14

1 2 3 4 4 5 6 6 7 7 8 9 8 9 9

GlobalNodeArray

GlobalEdgeArray

GLOBALMEMORYGlobalEdgeStart

GlobalNodeID

0 1 2 3 4 5 6 7 8 9 10 11 1312 14

0 1 2 3 4 5 6 7 8

EdgeStartGlobalNodeID

ParentCounter Level

32bits 16bits 16bits 16bits

BasePointer

TranslationofGlobalEdgeStarttoLocalEdgeStart𝐿𝐸𝑆𝑖 = (𝐺𝐸𝑆𝑖)% 𝐸𝑑𝑔𝑒𝐴𝑟𝑟𝑎𝑦

0 2 3 40 1 2 3

6 7 7 8 8 5 6

DEPENDENCYGRAPHBUFFER(DGB)LocalNodeArray

LocalEdgeArray

LocalEdgeStart

GlobalNodeIDH

0 1 2 3 4 5 6

PendingUpdateBuffer

NodeInsertionBuffer

Page 43: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �17

DATS Overview

0 2 4 6 7 9 11 12 14

1 2 3 4 4 5 6 6 7 7 8 9 8 9 9

GlobalNodeArray

GlobalEdgeArray

GLOBALMEMORYGlobalEdgeStart

GlobalNodeID

0 1 2 3 4 5 6 7 8 9 10 11 1312 14

0 1 2 3 4 5 6 7 8

EdgeStartGlobalNodeID

ParentCounter Level

32bits 16bits 16bits 16bits

BasePointer

TranslationofGlobalEdgeStarttoLocalEdgeStart𝐿𝐸𝑆𝑖 = (𝐺𝐸𝑆𝑖)% 𝐸𝑑𝑔𝑒𝐴𝑟𝑟𝑎𝑦

0 2 3 40 1 2 3

6 7 7 8 8 5 6

DEPENDENCYGRAPHBUFFER(DGB)LocalNodeArray

LocalEdgeArray

LocalEdgeStart

GlobalNodeIDH

0 1 2 3 4 5 6

PendingUpdateBuffer

NodeInsertionBuffer

Page 44: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �17

DATS Overview

0 2 4 6 7 9 11 12 14

1 2 3 4 4 5 6 6 7 7 8 9 8 9 9

GlobalNodeArray

GlobalEdgeArray

GLOBALMEMORYGlobalEdgeStart

GlobalNodeID

0 1 2 3 4 5 6 7 8 9 10 11 1312 14

0 1 2 3 4 5 6 7 8

EdgeStartGlobalNodeID

ParentCounter Level

32bits 16bits 16bits 16bits

BasePointer

TranslationofGlobalEdgeStarttoLocalEdgeStart𝐿𝐸𝑆𝑖 = (𝐺𝐸𝑆𝑖)% 𝐸𝑑𝑔𝑒𝐴𝑟𝑟𝑎𝑦

0 2 3 40 1 2 3

6 7 7 8 8 5 6

DEPENDENCYGRAPHBUFFER(DGB)LocalNodeArray

LocalEdgeArray

LocalEdgeStart

GlobalNodeIDH

0 1 2 3 4 5 6

PendingUpdateBuffer

NodeInsertionBuffer

(Circularbuffer)

7 9 11 12

GlobalIDEdgeindex

LocalEdgeStartNodeindex

Page 45: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �18

Node State Table

State R W W W

ParentCount 0 1 1 1

Level 0 1 1 2

GlobalNodeID 0 1 2 3

States:WaitReady

ProcessingDone

0 2 4 6

1 2 3 4 4 5 6

Tail Head

LocalEdgeStartLocalNodeArray

LocalEdgeArray

GlobalNodeID

Page 46: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �18

Node State Table

State R W W W

ParentCount 0 1 1 1

Level 0 1 1 2

GlobalNodeID 0 1 2 3

States:WaitReady

ProcessingDone

0 2 4 6

1 2 3 4 4 5 6

T H

LocalEdgeStart

GlobalNodeID

0

2

5

1

4

7

3

6

8

Page 47: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �19

Example: Child Node Execution

State R W W W

ParentCount 0 1 1 1

Level 0 1 1 2

GlobalNodeID 0 1 2 3

States:WaitReady

ProcessingDone

1 2 3 4 4 5 6

T H

0 2 4 6

0

2

5

1

4

7

3

6

8

Page 48: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �19

Example: Child Node Execution

State R W W W

ParentCount 0 1 1 1

Level 0 1 1 2

GlobalNodeID 0 1 2 3

States:WaitReady

ProcessingDone

1 2 3 4 4 5 6

T H

0 2 4 6

1

D1

0

2

5

1

4

7

3

6

8

Page 49: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �19

Example: Child Node Execution

State R W W W

ParentCount 0 1 1 1

Level 0 1 1 2

GlobalNodeID 0 1 2 3

States:WaitReady

ProcessingDone

1 2 3 4 4 5 6

T H

0 2 4 6

1

2 20 0

D1

0

2

5

1

4

7

3

6

8

Page 50: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �19

Example: Child Node Execution

State R W W W

ParentCount 0 1 1 1

Level 0 1 1 2

GlobalNodeID 0 1 2 3

States:WaitReady

ProcessingDone

1 2 3 4 4 5 6

T H

0 2 4 6

1

2 2

3 3R R

0 0

D1

0

2

5

1

4

7

3

6

8

Page 51: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �19

Example: Child Node Execution

State R W W W

ParentCount 0 1 1 1

Level 0 1 1 2

GlobalNodeID 0 1 2 3

States:WaitReady

ProcessingDone

1 2 3 4 4 5 6

T H

0 2 4 6

1

2 2

3 3R R

0 0

D

4

1

0

2

5

1

4

7

3

6

8

Page 52: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �20

Example: Update Buffer Store

State D P P W

ParentCount 0 0 0 1

Level 0 1 1 2

GlobalNodeID 0 1 2 3

States:WaitReady

ProcessingDone

1 2 3 4 4 5 6

T H

0 2 4 6

0

2

5

1

4

7

3

6

8

Page 53: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �20

Example: Update Buffer Store

State D P P W

ParentCount 0 0 0 1

Level 0 1 1 2

GlobalNodeID 0 1 2 3

States:WaitReady

ProcessingDone

1 2 3 4 4 5 6

T H

0 2 4 6

1

D1

0

2

5

1

4

7

3

6

8

Page 54: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �20

Example: Update Buffer Store

State D P P W

ParentCount 0 0 0 1

Level 0 1 1 2

GlobalNodeID 0 1 2 3

States:WaitReady

ProcessingDone

1 2 3 4 4 5 6

T H

0 2 4 6

1

D1

0

2

5

1

4

7

3

6

8

Page 55: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �20

Example: Update Buffer Store

State D P P W

ParentCount 0 0 0 1

Level 0 1 1 2

GlobalNodeID 0 1 2 3

States:WaitReady

ProcessingDone

1 2 3 4 4 5 6

T H

0 2 4 6

1 2

#4 #5

D

2PendingUpdateBuffer

1

0

2

5

1

4

7

3

6

8

Page 56: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �21

Example: Invalidation…

State D P D W

ParentCount 0 0 0 1

Level 0 1 1 2

GlobalNodeID 0 1 2 3

States:WaitReady

ProcessingDone

1 2 3 4 4 5 6

T H

0 2 4 6 #4 #5

0

2

5

1

4

7

3

6

8

Page 57: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �21

Example: Invalidation…

State D P D W

ParentCount 0 0 0 1

Level 0 1 1 2

GlobalNodeID 0 1 2 3

States:WaitReady

ProcessingDone

1 2 3 4 4 5 6

T H

0 2 4 6 #4 #5

1

D

0

2

5

1

4

7

3

6

8

Page 58: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �21

Example: Invalidation…

State D P D W

ParentCount 0 0 0 1

Level 0 1 1 2

GlobalNodeID 0 1 2 3

States:WaitReady

ProcessingDone

1 2 3 4 4 5 6

T H

0 2 4 6 #4 #5

1

2

3 R0

D

0

2

5

1

4

7

3

6

8

Page 59: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �21

Example: Invalidation…

State D P D W

ParentCount 0 0 0 1

Level 0 1 1 2

GlobalNodeID 0 1 2 3

States:WaitReady

ProcessingDone

1 2 3 4 4 5 6

T H

0 2 4 6 #4 #5 #4

1

2

3 R0

4

D

0

2

5

1

4

7

3

6

8

Page 60: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �21

Example: Invalidation…

State D P D W

ParentCount 0 0 0 1

Level 0 1 1 2

GlobalNodeID 0 1 2 3

States:WaitReady

ProcessingDone

1 2 3 4 4 5 6

T H

0 2 4 6 #4 #5 #4

1

2

3 R0

4

5

D

0

2

5

1

4

7

3

6

8

Page 61: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �21

Example: Invalidation…

State D P D W

ParentCount 0 0 0 1

Level 0 1 1 2

GlobalNodeID 0 1 2 3

States:WaitReady

ProcessingDone

1 2 3 4 4 5 6

T H

0 2 4 6 #4 #5 #4

EnoughspacestoloadtoDGB

1

2

3 R0

4

5

D

0

2

5

1

4

7

3

6

8

Page 62: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �22

Example: …Reloading data

State W W D R

ParentCount 2 1 0 0

Level 2 2 1 2

GlobalNodeID 4 5 2 3

States:WaitReady

ProcessingDone

6 7 7 - - - 6

T H

0 2 - 6 #4 #5 #4

Loadcomplete!

0

2

5

1

4

7

3

6

8

Page 63: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �22

Example: …Reloading data

State W W D R

ParentCount 2 1 0 0

Level 2 2 1 2

GlobalNodeID 4 5 2 3

States:WaitReady

ProcessingDone

6 7 7 - - - 6

T H

0 2 - 6 #4 #5 #4

Loadcomplete!

0

2

5

1

4

7

3

6

8

Page 64: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �22

Example: …Reloading data

State W W D R

ParentCount 2 1 0 0

Level 2 2 1 2

GlobalNodeID 4 5 2 3

States:WaitReady

ProcessingDone

6 7 7 - - - 6

T H

0 2 - 6 #4 #5 #4

Loadcomplete!

6

0

2

5

1

4

7

3

6

8

Page 65: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �23

Example: Update Buffer Load

State W W D P

ParentCount 2 1 0 0

Level 2 2 1 2

GlobalNodeID 4 5 2 3

States:WaitReady

ProcessingDone

6 7 7 - - - 6

TH

0 2 - 6 #4 #5 #4

0

2

5

1

4

7

3

6

8

Page 66: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �23

Example: Update Buffer Load

State W W D P

ParentCount 2 1 0 0

Level 2 2 1 2

GlobalNodeID 4 5 2 3

States:WaitReady

ProcessingDone

6 7 7 - - - 6

TH

0 2 - 6 #5

1

3

10 2

R

0

2

5

1

4

7

3

6

8

Page 67: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �23

Example: Update Buffer Load

State W W D P

ParentCount 2 1 0 0

Level 2 2 1 2

GlobalNodeID 4 5 2 3

States:WaitReady

ProcessingDone

6 7 7 - - - 6

TH

0 2 - 6

1

3

4

10 2

R

0

R

0

2

5

1

4

7

3

6

8

Page 68: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �24

Challenges

•Minimizingglobalmemoryusage• UsedCSRformat

•Minimizingthebuffersize• LimitLevelRange

• LocalNode/EdgeArraysize

Page 69: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �25

Level Range

• UnbalancedexecutionmayentailusingthebaselineTBschedulingpolicy(LRR).

Samplebenchmark(HEAT2D)w/LRRscheduler

Page 70: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �26

Level Range

• Unboundedlevelrangemeans:• LargerDGBisrequired• LimitingTBexecution

Keychallenge:Efficientscheduling

Page 71: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �27

Level-bound Scheduling (LVL)• Prioritizinglower-levelthreadblocksinthegraph• Morereadynodesà Moreparallelism• Minimizingthebufferingoperation• Limitingthelevelrangetoavoidserialization

Page 72: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �28

Local Node Array Size

• Empiricalestimationused

Page 73: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �28

Local Node Array Size

• Empiricalestimationused• Reducesize

• Untilperformancesuffers

Page 74: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �28

IPC

0

125

250

375

500

Max

Upd

ate

buffe

r S

ize

(ent

ries)

0

40

80

120

160

Local Node Array Size (entries)

64 128 192 320 512 576 640

LRR_PUB LVL_PUB LRR_IPC LVL_IPC

Local Node Array Size

• Empiricalestimationused• Reducesize

• Untilperformancesuffers

Page 75: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �28

IPC

0

125

250

375

500

Max

Upd

ate

buffe

r S

ize

(ent

ries)

0

40

80

120

160

Local Node Array Size (entries)

64 128 192 320 512 576 640

LRR_PUB LVL_PUB LRR_IPC LVL_IPC

Local Node Array Size

• Empiricalestimationused• Reducesize

• Untilperformancesuffers

Page 76: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �28

IPC

0

125

250

375

500

Max

Upd

ate

buffe

r S

ize

(ent

ries)

0

40

80

120

160

Local Node Array Size (entries)

64 128 192 320 512 576 640

LRR_PUB LVL_PUB LRR_IPC LVL_IPC

Local Node Array Size

• Empiricalestimationused• Reducesize

• Untilperformancesuffers

Sizechosen(128entries)

Page 77: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �28

IPC

0

125

250

375

500

Max

Upd

ate

buffe

r S

ize

(ent

ries)

0

40

80

120

160

Local Node Array Size (entries)

64 128 192 320 512 576 640

LRR_PUB LVL_PUB LRR_IPC LVL_IPC

Local Node Array Size

• Empiricalestimationused• Reducesize

• Untilperformancesuffers

• LVLsaves64%PUBsize

Sizechosen(128entries)

64%PUBsizereduction

Page 78: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �29

Local Edge Array Size

• Empiricalestimationused

Page 79: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �29

Local Edge Array Size

• Empiricalestimationused

Page 80: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �29

Local Edge Array Size

• Empiricalestimationused• LVLrequires75%lessstorage

75%EdgeArrayreduction

Page 81: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �29

Local Edge Array Size

• Empiricalestimationused• LVLrequires75%lessstorage• 256entries

Page 82: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �30

Evaluation

• Evaluationplatform• GPGPU-Simv3.2.2(GTX480)• Sixdatadependency-heavybenchmarks

• Cases• Global,CDP• DepLinksprimitives• LRRandLVL

• LVL=3

Page 83: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �31

Performance Breakdown 4KGraphSize

Page 84: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �31

Performance Breakdown

Nor

mal

ized

Sp

eedu

p

0.9

1.025

1.15

1.275

1.4

DTW

HEAT2DHIST

INT_IM

GSOR SW

Averag

e

LVL LRR DepLinks CDP Global4KGraph

Size

NOTaccountingfordatatransfertime

Page 85: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �31

Performance Breakdown

Nor

mal

ized

Sp

eedu

p

0.9

1.025

1.15

1.275

1.4

DTW

HEAT2DHIST

INT_IM

GSOR SW

Averag

e

LVL LRR DepLinks CDP Global4KGraph

Size

Theoreticalvaluebasedon[1]

[1]JinWang,NormRubin,AlbertSidelnik,andSudhakarYalamanchili.2016.Dynamicthreadblocklaunch:AlightweightexecutionmechanismtosupportirregularapplicationsonGPUs.

Page 86: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �31

Performance Breakdown

Nor

mal

ized

Sp

eedu

p

0.9

1.025

1.15

1.275

1.4

DTW

HEAT2DHIST

INT_IM

GSOR SW

Averag

e

LVL LRR DepLinks CDP Global4KGraph

Size

BarriersenforcedbyDepLinksinstead;kernellaunchoverheadremoved

Page 87: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �31

Performance Breakdown

Nor

mal

ized

Sp

eedu

p

0.9

1.025

1.15

1.275

1.4

DTW

HEAT2DHIST

INT_IM

GSOR SW

Averag

e

LVL LRR DepLinks CDP Global4KGraph

Size

Barriersremoved.Nodescannowrunahead.

Page 88: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �31

Performance Breakdown

Nor

mal

ized

Sp

eedu

p

0.9

1.025

1.15

1.275

1.4

DTW

HEAT2DHIST

INT_IM

GSOR SW

Averag

e

LVL LRR DepLinks CDP Global4KGraph

Size

Level-boundTBscheduling

Page 89: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �32

Ove

rall

Spee

dup

(LVL

)

0.9

1.1

1.3

1.5

1.7

DTW HEAT2D HIST INT_IMG SOR SW GeoMean

1K 4K 9K

Performance

• Speedupacrossdifferentgraphsizes

+45%

Page 90: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �32

Ove

rall

Spee

dup

(LVL

)

0.9

1.1

1.3

1.5

1.7

DTW HEAT2D HIST INT_IMG SOR SW GeoMean

1K 4K 9K

Performance

• Speedupacrossdifferentgraphsizes

+45%

+65%

Page 91: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �33

Evaluation Summary

• 2KBareaoverhead

• NosignificantimpactonL2missrate

• Lowglobalmemoryrequestoverhead• 0.13%Average

Page 92: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �34

Conclusion

• PresentingWireframe,hardwaresupportforGPUdatadependency

• Supportinggeneralizedinter-blockdependenciesthroughhardware

• Minimizingbufferingthroughlevel-boundTBscheduling

• 45%averagespeedupimprovementoverthebaseline

Page 93: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �35

Thank you!

Questions?

Page 94: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �36

Computations vs Launch Overhead

• Withaconstantdatasize• Kernellaunchesincreasewithgraphsize

• isstillsizableat9Knodes.• timesonaverage

Com

p/La

unch

R

atio

0

3.5

7

10.5

14

DTW HIST SOR GeoMean

1K 4K9K

Page 95: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �37

Performance

• ImpactonL2~0.5%

Page 96: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �38

Performance (IPC)

Page 97: WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �39

Thank you!

Questions?