WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1

WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �1

WIREFRAME: Supporting Data-dependent Parallelism through

Dependency Graph Execution in GPUs

AmirAli Abdolrashidi†, Devashree Tripathy†, Mehmet E. Belviranli‡,

Laxmi N. Bhuyan†, Daniel Wong†

†University of California Riverside

‡Oak Ridge National Laboratory


Introduction


Introduction


Motivation

• Despitethesupportforparallelism,GPUslacksupportfordata-dependentparallelism.


Example: Wavefront Pattern

31

1 2 1

10

10 11

3 12

Barrier

Threadblock



31

1 2 1

10

10 11

3 12



31

1 2 1

10

10 11

3 12…untiltheapplicationends


Example

GlobalBarriers(Original)

fori=1tonWave:-KernelLaunch-Synchronize


Example



Enormoushost-sidekernellaunchoverhead!


Example




Waitingonnon-parentthreadblocks


Example






Example






Example






Example

CDP(Nested)

RUN:-ParentKernelLaunch-Synchronize

ParentKernel:fori=1tonWaves: -ChildKernelLaunch-Synchronize


Example

CDP(Nested)

KernelExecutionPattern




Example

CDP(Nested)





Example

CDP(Nested)





Example

CDP(Nested)





Example

CDP(Nested)





Example

CDP(Nested)




…


Example

CDP(Nested)




…


Example

CDP(Nested)




…


Example

CDP(Nested) • Nomorehost-sidekernellaunch

• Device-sidekernellaunchstillhassignificantoverhead

• NOmulti-parentdependencysupport

• StillNOgeneraldependencysupport!




Motivation

• Thereisaneedforageneralizedsupportforfiner-graininter-blockdatadependencyformoreperformanceandefficiency.

Intra-Block Global Inter-BlockThreadThreadBlockBarrier

c


Motivation

•Currentlimitations• Highdevice-sidekernellaunchoverhead• Nogeneralinter-blockdatadependencysupport


Wireframe Overview

Host(CPU)

Device(GPU)


Wireframe Overview

Host(CPU)

Device(GPU)

#define parent1 dim3 (blockIdx.x-1, blockIdx.y, blockIdx.z);#define parent2 dim3 (blockIdx.x, blockIdx.y-1, blockIdx.z);void* DepLink() {

if (blockIdx.x > 0) WF::AddDependency(parent1);

if (blockIdx.y > 0) WF::AddDependency(parent2);}int main() {

kernel<<<GridSize, BlockSize, DepLink>>>(0, args);}__WF__ void kernel(args) {

processWave();}

ProgrammingModel


Wireframe Overview

Host(CPU)

Device(GPU)





processWave();}

ProgrammingModel DependencyGraph


Wireframe Overview

Host(CPU)

Device(GPU)





processWave();}

ProgrammingModel DependencyGraph ConverttoCSR

NodeArray

EdgeArray


Wireframe Overview

Host(CPU)

Device(GPU)





processWave();}


GlobalMemory

GlobalNodeArray

GlobalEdgeArray

ConverttoCSR

NodeArray

EdgeArray


Wireframe Overview

Host(CPU)

Device(GPU)





processWave();}


GlobalMemory

GlobalNodeArray

GlobalEdgeArrayPendingUpdateBuffer

DATSHardware(DependencyGraphBuffer)

LocalEdgeArray

LocalNodeArray

NodeInsertionBuffer

ConverttoCSR

NodeArray

EdgeArray


Wireframe Overview

Host(CPU)

Device(GPU)





processWave();}


GlobalMemory

GlobalNodeArray

GlobalEdgeArrayPendingUpdateBuffer

DATSHardware(DependencyGraphBuffer)

LocalEdgeArray

LocalNodeArray

NodeInsertionBuffer

TBScheduler

ConverttoCSR

NodeArray

EdgeArray


Programming Model

• NewfunctionsareneededtosupportdependencyinCUDA• Adddependency• Policysettings

• ProposingDepLinksmodel• Wouldassignadependencygraphgenerationfunctiontoakernel• Easytolearnanduse


0

4 1

8 5 2

12 9 6

13 10 7

14 11

15

3

Wireframe Pseudo-code

parent1:=(X-1,Y)parent2:=(X,Y-1)

RUN:-KernelLaunch(DepLinks)

DepLinks@BLOCK(X,Y):-AddDependency(parent1)-AddDependency(parent2)

XY

(0,0)


0

4 1

8 5 2

12 9 6

13 10 7

14 11

15

3




DepLinks@BLOCK(X,Y):-AddDependency(parent1)-AddDependency(parent2)

6

5 2

parent1 parent2

XY

(0,0)


0

4 1

8 5 2

12 9 6

13 10 7

14 11

15

3




DepLinks@BLOCK(X,Y):-AddDependency(parent1)-AddDependency(parent2) Onekernellaunch!

6

5 2

parent1 parent2

XY

(0,0)


Dependency Graph

• Parentcountandlevelofeverynodedeterminedatruntime

• SenttotheGPU’sglobalmemory


Node Renaming

• Tominimizedatalevelrangeinthebuffers

Level0

Level1

Level2

Level3

Level4

Level5


Dependency-Aware TB Scheduler (DATS)

• Threadblockscheduler• Issuestherelevantthreadblockatthetimeforexecutionbasedonthedependencygraph

• DependencyGraphBuffer(DGB)• Cachedatafromglobalmemory• Challenge:Efficientcachinganddatautilization


Dependency-Aware TB Scheduler (DATS)

Datastoredincompressedsparse(CSR)format• Toreducememoryusage

• Threadblocksà NodeArray• Dependenciesà EdgeArray

• spacecomplexity


DATS Overview

0 2 4 6 7 9 11 12 14

1 2 3 4 4 5 6 6 7 7 8 9 8 9 9

GlobalNodeArray

GlobalEdgeArray

GLOBALMEMORYGlobalEdgeStart

GlobalNodeID

0 1 2 3 4 5 6 7 8 9 10 11 1312 14

0 1 2 3 4 5 6 7 8

EdgeStartGlobalNodeID

ParentCounter Level

32bits 16bits 16bits 16bits

BasePointer

TranslationofGlobalEdgeStarttoLocalEdgeStart𝐿𝐸𝑆𝑖 = (𝐺𝐸𝑆𝑖)% 𝐸𝑑𝑔𝑒𝐴𝑟𝑟𝑎𝑦

0 2 3 40 1 2 3

6 7 7 8 8 5 6

DEPENDENCYGRAPHBUFFER(DGB)LocalNodeArray

LocalEdgeArray

LocalEdgeStart

GlobalNodeIDH

0 1 2 3 4 5 6

PendingUpdateBuffer

NodeInsertionBuffer


DATS Overview

0 2 4 6 7 9 11 12 14

1 2 3 4 4 5 6 6 7 7 8 9 8 9 9

GlobalNodeArray

GlobalEdgeArray


GlobalNodeID

0 1 2 3 4 5 6 7 8 9 10 11 1312 14

0 1 2 3 4 5 6 7 8


ParentCounter Level


BasePointer


0 2 3 40 1 2 3

6 7 7 8 8 5 6


LocalEdgeArray

LocalEdgeStart

GlobalNodeIDH

0 1 2 3 4 5 6

PendingUpdateBuffer

NodeInsertionBuffer


DATS Overview

0 2 4 6 7 9 11 12 14

1 2 3 4 4 5 6 6 7 7 8 9 8 9 9

GlobalNodeArray

GlobalEdgeArray


GlobalNodeID

0 1 2 3 4 5 6 7 8 9 10 11 1312 14

0 1 2 3 4 5 6 7 8


ParentCounter Level


BasePointer


0 2 3 40 1 2 3

6 7 7 8 8 5 6


LocalEdgeArray

LocalEdgeStart

GlobalNodeIDH

0 1 2 3 4 5 6

PendingUpdateBuffer

NodeInsertionBuffer


DATS Overview

0 2 4 6 7 9 11 12 14

1 2 3 4 4 5 6 6 7 7 8 9 8 9 9

GlobalNodeArray

GlobalEdgeArray


GlobalNodeID

0 1 2 3 4 5 6 7 8 9 10 11 1312 14

0 1 2 3 4 5 6 7 8


ParentCounter Level


BasePointer


0 2 3 40 1 2 3

6 7 7 8 8 5 6


LocalEdgeArray

LocalEdgeStart

GlobalNodeIDH

0 1 2 3 4 5 6

PendingUpdateBuffer

NodeInsertionBuffer

(Circularbuffer)

7 9 11 12

GlobalIDEdgeindex

LocalEdgeStartNodeindex


Node State Table

State R W W W

ParentCount 0 1 1 1

Level 0 1 1 2

GlobalNodeID 0 1 2 3

States:WaitReady

ProcessingDone

0 2 4 6

1 2 3 4 4 5 6

Tail Head

LocalEdgeStartLocalNodeArray

LocalEdgeArray

GlobalNodeID


Node State Table

State R W W W

ParentCount 0 1 1 1

Level 0 1 1 2


States:WaitReady

ProcessingDone

0 2 4 6

1 2 3 4 4 5 6

T H

LocalEdgeStart

GlobalNodeID

0

2

5

1

4

7

3

6

8


Example: Child Node Execution

State R W W W

ParentCount 0 1 1 1

Level 0 1 1 2


States:WaitReady

ProcessingDone

1 2 3 4 4 5 6

T H

0 2 4 6

0

2

5

1

4

7

3

6

8



State R W W W

ParentCount 0 1 1 1

Level 0 1 1 2


States:WaitReady

ProcessingDone

1 2 3 4 4 5 6

T H

0 2 4 6

1

D1

0

2

5

1

4

7

3

6

8



State R W W W

ParentCount 0 1 1 1

Level 0 1 1 2


States:WaitReady

ProcessingDone

1 2 3 4 4 5 6

T H

0 2 4 6

1

2 20 0

D1

0

2

5

1

4

7

3

6

8



State R W W W

ParentCount 0 1 1 1

Level 0 1 1 2


States:WaitReady

ProcessingDone

1 2 3 4 4 5 6

T H

0 2 4 6

1

2 2

3 3R R

0 0

D1

0

2

5

1

4

7

3

6

8



State R W W W

ParentCount 0 1 1 1

Level 0 1 1 2


States:WaitReady

ProcessingDone

1 2 3 4 4 5 6

T H

0 2 4 6

1

2 2

3 3R R

0 0

D

4

1

0

2

5

1

4

7

3

6

8


Example: Update Buffer Store

State D P P W

ParentCount 0 0 0 1

Level 0 1 1 2


States:WaitReady

ProcessingDone

1 2 3 4 4 5 6

T H

0 2 4 6

0

2

5

1

4

7

3

6

8



State D P P W

ParentCount 0 0 0 1

Level 0 1 1 2


States:WaitReady

ProcessingDone

1 2 3 4 4 5 6

T H

0 2 4 6

1

D1

0

2

5

1

4

7

3

6

8



State D P P W

ParentCount 0 0 0 1

Level 0 1 1 2


States:WaitReady

ProcessingDone

1 2 3 4 4 5 6

T H

0 2 4 6

1

D1

0

2

5

1

4

7

3

6

8



State D P P W

ParentCount 0 0 0 1

Level 0 1 1 2


States:WaitReady

ProcessingDone

1 2 3 4 4 5 6

T H

0 2 4 6

1 2

#4 #5

D

2PendingUpdateBuffer

1

0

2

5

1

4

7

3

6

8


Example: Invalidation…

State D P D W

ParentCount 0 0 0 1

Level 0 1 1 2


States:WaitReady

ProcessingDone

1 2 3 4 4 5 6

T H

0 2 4 6 #4 #5

0

2

5

1

4

7

3

6

8



State D P D W

ParentCount 0 0 0 1

Level 0 1 1 2


States:WaitReady

ProcessingDone

1 2 3 4 4 5 6

T H

0 2 4 6 #4 #5

1

D

0

2

5

1

4

7

3

6

8



State D P D W

ParentCount 0 0 0 1

Level 0 1 1 2


States:WaitReady

ProcessingDone

1 2 3 4 4 5 6

T H

0 2 4 6 #4 #5

1

2

3 R0

D

0

2

5

1

4

7

3

6

8



State D P D W

ParentCount 0 0 0 1

Level 0 1 1 2


States:WaitReady

ProcessingDone

1 2 3 4 4 5 6

T H

0 2 4 6 #4 #5 #4

1

2

3 R0

4

D

0

2

5

1

4

7

3

6

8



State D P D W

ParentCount 0 0 0 1

Level 0 1 1 2


States:WaitReady

ProcessingDone

1 2 3 4 4 5 6

T H

0 2 4 6 #4 #5 #4

1

2

3 R0

4

5

D

0

2

5

1

4

7

3

6

8



State D P D W

ParentCount 0 0 0 1

Level 0 1 1 2


States:WaitReady

ProcessingDone

1 2 3 4 4 5 6

T H

0 2 4 6 #4 #5 #4

EnoughspacestoloadtoDGB

1

2

3 R0

4

5

D

0

2

5

1

4

7

3

6

8


Example: …Reloading data

State W W D R

ParentCount 2 1 0 0

Level 2 2 1 2


States:WaitReady

ProcessingDone

6 7 7 - - - 6

T H

0 2 - 6 #4 #5 #4

Loadcomplete!

0

2

5

1

4

7

3

6

8



State W W D R

ParentCount 2 1 0 0

Level 2 2 1 2


States:WaitReady

ProcessingDone

6 7 7 - - - 6

T H

0 2 - 6 #4 #5 #4

Loadcomplete!

0

2

5

1

4

7

3

6

8



State W W D R

ParentCount 2 1 0 0

Level 2 2 1 2


States:WaitReady

ProcessingDone

6 7 7 - - - 6

T H

0 2 - 6 #4 #5 #4

Loadcomplete!

6

0

2

5

1

4

7

3

6

8


Example: Update Buffer Load

State W W D P

ParentCount 2 1 0 0

Level 2 2 1 2


States:WaitReady

ProcessingDone

6 7 7 - - - 6

TH

0 2 - 6 #4 #5 #4

0

2

5

1

4

7

3

6

8



State W W D P

ParentCount 2 1 0 0

Level 2 2 1 2


States:WaitReady

ProcessingDone

6 7 7 - - - 6

TH

0 2 - 6 #5

1

3

10 2

R

0

2

5

1

4

7

3

6

8



State W W D P

ParentCount 2 1 0 0

Level 2 2 1 2


States:WaitReady

ProcessingDone

6 7 7 - - - 6

TH

0 2 - 6

1

3

4

10 2

R

0

R

0

2

5

1

4

7

3

6

8


Challenges

•Minimizingglobalmemoryusage• UsedCSRformat

•Minimizingthebuffersize• LimitLevelRange

• LocalNode/EdgeArraysize


Level Range

• UnbalancedexecutionmayentailusingthebaselineTBschedulingpolicy(LRR).

Samplebenchmark(HEAT2D)w/LRRscheduler


Level Range

• Unboundedlevelrangemeans:• LargerDGBisrequired• LimitingTBexecution

Keychallenge:Efficientscheduling


Level-bound Scheduling (LVL)• Prioritizinglower-levelthreadblocksinthegraph• Morereadynodesà Moreparallelism• Minimizingthebufferingoperation• Limitingthelevelrangetoavoidserialization


Local Node Array Size

• Empiricalestimationused



• Empiricalestimationused• Reducesize

• Untilperformancesuffers


IPC

0

125

250

375

500

Max

Upd

ate

buffe

r S

ize

(ent

ries)

0

40

80

120

160

Local Node Array Size (entries)

64 128 192 320 512 576 640

LRR_PUB LVL_PUB LRR_IPC LVL_IPC





IPC

0

125

250

375

500

Max

Upd

ate

buffe

r S

ize

(ent

ries)

0

40

80

120

160


64 128 192 320 512 576 640






IPC

0

125

250

375

500

Max

Upd

ate

buffe

r S

ize

(ent

ries)

0

40

80

120

160


64 128 192 320 512 576 640





Sizechosen(128entries)


IPC

0

125

250

375

500

Max

Upd

ate

buffe

r S

ize

(ent

ries)

0

40

80

120

160


64 128 192 320 512 576 640





• LVLsaves64%PUBsize

Sizechosen(128entries)

64%PUBsizereduction


Local Edge Array Size







• Empiricalestimationused• LVLrequires75%lessstorage

75%EdgeArrayreduction



• Empiricalestimationused• LVLrequires75%lessstorage• 256entries


Evaluation

• Evaluationplatform• GPGPU-Simv3.2.2(GTX480)• Sixdatadependency-heavybenchmarks

• Cases• Global,CDP• DepLinksprimitives• LRRandLVL

• LVL=3


Performance Breakdown 4KGraphSize


Performance Breakdown

Nor

mal

ized

Sp

eedu

p

0.9

1.025

1.15

1.275

1.4

DTW

HEAT2DHIST

INT_IM

GSOR SW

Averag

e

LVL LRR DepLinks CDP Global4KGraph

Size

NOTaccountingfordatatransfertime



Nor

mal

ized

Sp

eedu

p

0.9

1.025

1.15

1.275

1.4

DTW

HEAT2DHIST

INT_IM

GSOR SW

Averag

e


Size

Theoreticalvaluebasedon[1]

[1]JinWang,NormRubin,AlbertSidelnik,andSudhakarYalamanchili.2016.Dynamicthreadblocklaunch:AlightweightexecutionmechanismtosupportirregularapplicationsonGPUs.



Nor

mal

ized

Sp

eedu

p

0.9

1.025

1.15

1.275

1.4

DTW

HEAT2DHIST

INT_IM

GSOR SW

Averag

e


Size

BarriersenforcedbyDepLinksinstead;kernellaunchoverheadremoved



Nor

mal

ized

Sp

eedu

p

0.9

1.025

1.15

1.275

1.4

DTW

HEAT2DHIST

INT_IM

GSOR SW

Averag

e


Size

Barriersremoved.Nodescannowrunahead.



Nor

mal

ized

Sp

eedu

p

0.9

1.025

1.15

1.275

1.4

DTW

HEAT2DHIST

INT_IM

GSOR SW

Averag

e


Size

Level-boundTBscheduling


Ove

rall

Spee

dup

(LVL

)

0.9

1.1

1.3

1.5

1.7

DTW HEAT2D HIST INT_IMG SOR SW GeoMean

1K 4K 9K

Performance

• Speedupacrossdifferentgraphsizes

+45%


Ove

rall

Spee

dup

(LVL

)

0.9

1.1

1.3

1.5

1.7

DTW HEAT2D HIST INT_IMG SOR SW GeoMean

1K 4K 9K

Performance

• Speedupacrossdifferentgraphsizes

+45%

+65%


Evaluation Summary

• 2KBareaoverhead

• NosignificantimpactonL2missrate

• Lowglobalmemoryrequestoverhead• 0.13%Average


Conclusion

• PresentingWireframe,hardwaresupportforGPUdatadependency

• Supportinggeneralizedinter-blockdependenciesthroughhardware

• Minimizingbufferingthroughlevel-boundTBscheduling

• 45%averagespeedupimprovementoverthebaseline


Thank you!

Questions?


Computations vs Launch Overhead

• Withaconstantdatasize• Kernellaunchesincreasewithgraphsize

• isstillsizableat9Knodes.• timesonaverage

Com

p/La

unch

R

atio

0

3.5

7

10.5

14

DTW HIST SOR GeoMean

1K 4K9K


Performance

• ImpactonL2~0.5%


Performance (IPC)


Thank you!

Questions?

Documents

WIREFRAME: Supporting Data-dependent Parallelism through … · WIREFRAME: Supporting Data-dependent Parallelism in GPUs MICRO 50 4 Example: Wavefront Pattern 1 3 1 2 1 10 10 1 1