Upload
others
View
14
Download
0
Embed Size (px)
Citation preview
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �1
WIREFRAME: Supporting Data-dependent Parallelism through
Dependency Graph Execution in GPUs
AmirAli Abdolrashidi†, Devashree Tripathy†, Mehmet E. Belviranli‡,
Laxmi N. Bhuyan†, Daniel Wong†
†University of California Riverside
‡Oak Ridge National Laboratory
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �2
Introduction
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �2
Introduction
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �3
Motivation
• Despitethesupportforparallelism,GPUslacksupportfordata-dependentparallelism.
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �4
Example: Wavefront Pattern
31
1 2 1
10
10 11
3 12
Barrier
Threadblock
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �4
Example: Wavefront Pattern
31
1 2 1
10
10 11
3 12
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �4
Example: Wavefront Pattern
31
1 2 1
10
10 11
3 12…untiltheapplicationends
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �5
Example
GlobalBarriers(Original)
fori=1tonWave:-KernelLaunch-Synchronize
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �5
Example
GlobalBarriers(Original)
fori=1tonWave:-KernelLaunch-Synchronize
Enormoushost-sidekernellaunchoverhead!
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �5
Example
GlobalBarriers(Original)
fori=1tonWave:-KernelLaunch-Synchronize
Enormoushost-sidekernellaunchoverhead!
Waitingonnon-parentthreadblocks
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �5
Example
GlobalBarriers(Original)
fori=1tonWave:-KernelLaunch-Synchronize
Enormoushost-sidekernellaunchoverhead!
Waitingonnon-parentthreadblocks
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �5
Example
GlobalBarriers(Original)
fori=1tonWave:-KernelLaunch-Synchronize
Enormoushost-sidekernellaunchoverhead!
Waitingonnon-parentthreadblocks
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �5
Example
GlobalBarriers(Original)
fori=1tonWave:-KernelLaunch-Synchronize
Enormoushost-sidekernellaunchoverhead!
Waitingonnon-parentthreadblocks
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �6
Example
CDP(Nested)
RUN:-ParentKernelLaunch-Synchronize
ParentKernel:fori=1tonWaves: -ChildKernelLaunch-Synchronize
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �6
Example
CDP(Nested)
KernelExecutionPattern
RUN:-ParentKernelLaunch-Synchronize
ParentKernel:fori=1tonWaves: -ChildKernelLaunch-Synchronize
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �6
Example
CDP(Nested)
KernelExecutionPattern
RUN:-ParentKernelLaunch-Synchronize
ParentKernel:fori=1tonWaves: -ChildKernelLaunch-Synchronize
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �6
Example
CDP(Nested)
KernelExecutionPattern
RUN:-ParentKernelLaunch-Synchronize
ParentKernel:fori=1tonWaves: -ChildKernelLaunch-Synchronize
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �6
Example
CDP(Nested)
KernelExecutionPattern
RUN:-ParentKernelLaunch-Synchronize
ParentKernel:fori=1tonWaves: -ChildKernelLaunch-Synchronize
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �6
Example
CDP(Nested)
KernelExecutionPattern
RUN:-ParentKernelLaunch-Synchronize
ParentKernel:fori=1tonWaves: -ChildKernelLaunch-Synchronize
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �6
Example
CDP(Nested)
KernelExecutionPattern
RUN:-ParentKernelLaunch-Synchronize
ParentKernel:fori=1tonWaves: -ChildKernelLaunch-Synchronize
…
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �6
Example
CDP(Nested)
KernelExecutionPattern
RUN:-ParentKernelLaunch-Synchronize
ParentKernel:fori=1tonWaves: -ChildKernelLaunch-Synchronize
…
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �6
Example
CDP(Nested)
KernelExecutionPattern
RUN:-ParentKernelLaunch-Synchronize
ParentKernel:fori=1tonWaves: -ChildKernelLaunch-Synchronize
…
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �7
Example
CDP(Nested) • Nomorehost-sidekernellaunch
• Device-sidekernellaunchstillhassignificantoverhead
• NOmulti-parentdependencysupport
• StillNOgeneraldependencysupport!
RUN:-ParentKernelLaunch-Synchronize
ParentKernel:fori=1tonWaves: -ChildKernelLaunch-Synchronize
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �8
Motivation
• Thereisaneedforageneralizedsupportforfiner-graininter-blockdatadependencyformoreperformanceandefficiency.
Intra-Block Global Inter-BlockThreadThreadBlockBarrier
c
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �9
Motivation
•Currentlimitations• Highdevice-sidekernellaunchoverhead• Nogeneralinter-blockdatadependencysupport
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �10
Wireframe Overview
Host(CPU)
Device(GPU)
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �10
Wireframe Overview
Host(CPU)
Device(GPU)
#define parent1 dim3 (blockIdx.x-1, blockIdx.y, blockIdx.z);#define parent2 dim3 (blockIdx.x, blockIdx.y-1, blockIdx.z);void* DepLink() {
if (blockIdx.x > 0) WF::AddDependency(parent1);
if (blockIdx.y > 0) WF::AddDependency(parent2);}int main() {
kernel<<<GridSize, BlockSize, DepLink>>>(0, args);}__WF__ void kernel(args) {
processWave();}
ProgrammingModel
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �10
Wireframe Overview
Host(CPU)
Device(GPU)
#define parent1 dim3 (blockIdx.x-1, blockIdx.y, blockIdx.z);#define parent2 dim3 (blockIdx.x, blockIdx.y-1, blockIdx.z);void* DepLink() {
if (blockIdx.x > 0) WF::AddDependency(parent1);
if (blockIdx.y > 0) WF::AddDependency(parent2);}int main() {
kernel<<<GridSize, BlockSize, DepLink>>>(0, args);}__WF__ void kernel(args) {
processWave();}
ProgrammingModel DependencyGraph
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �10
Wireframe Overview
Host(CPU)
Device(GPU)
#define parent1 dim3 (blockIdx.x-1, blockIdx.y, blockIdx.z);#define parent2 dim3 (blockIdx.x, blockIdx.y-1, blockIdx.z);void* DepLink() {
if (blockIdx.x > 0) WF::AddDependency(parent1);
if (blockIdx.y > 0) WF::AddDependency(parent2);}int main() {
kernel<<<GridSize, BlockSize, DepLink>>>(0, args);}__WF__ void kernel(args) {
processWave();}
ProgrammingModel DependencyGraph ConverttoCSR
NodeArray
EdgeArray
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �10
Wireframe Overview
Host(CPU)
Device(GPU)
#define parent1 dim3 (blockIdx.x-1, blockIdx.y, blockIdx.z);#define parent2 dim3 (blockIdx.x, blockIdx.y-1, blockIdx.z);void* DepLink() {
if (blockIdx.x > 0) WF::AddDependency(parent1);
if (blockIdx.y > 0) WF::AddDependency(parent2);}int main() {
kernel<<<GridSize, BlockSize, DepLink>>>(0, args);}__WF__ void kernel(args) {
processWave();}
ProgrammingModel DependencyGraph
GlobalMemory
GlobalNodeArray
GlobalEdgeArray
ConverttoCSR
NodeArray
EdgeArray
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �10
Wireframe Overview
Host(CPU)
Device(GPU)
#define parent1 dim3 (blockIdx.x-1, blockIdx.y, blockIdx.z);#define parent2 dim3 (blockIdx.x, blockIdx.y-1, blockIdx.z);void* DepLink() {
if (blockIdx.x > 0) WF::AddDependency(parent1);
if (blockIdx.y > 0) WF::AddDependency(parent2);}int main() {
kernel<<<GridSize, BlockSize, DepLink>>>(0, args);}__WF__ void kernel(args) {
processWave();}
ProgrammingModel DependencyGraph
GlobalMemory
GlobalNodeArray
GlobalEdgeArrayPendingUpdateBuffer
DATSHardware(DependencyGraphBuffer)
LocalEdgeArray
LocalNodeArray
NodeInsertionBuffer
ConverttoCSR
NodeArray
EdgeArray
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �10
Wireframe Overview
Host(CPU)
Device(GPU)
#define parent1 dim3 (blockIdx.x-1, blockIdx.y, blockIdx.z);#define parent2 dim3 (blockIdx.x, blockIdx.y-1, blockIdx.z);void* DepLink() {
if (blockIdx.x > 0) WF::AddDependency(parent1);
if (blockIdx.y > 0) WF::AddDependency(parent2);}int main() {
kernel<<<GridSize, BlockSize, DepLink>>>(0, args);}__WF__ void kernel(args) {
processWave();}
ProgrammingModel DependencyGraph
GlobalMemory
GlobalNodeArray
GlobalEdgeArrayPendingUpdateBuffer
DATSHardware(DependencyGraphBuffer)
LocalEdgeArray
LocalNodeArray
NodeInsertionBuffer
TBScheduler
ConverttoCSR
NodeArray
EdgeArray
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �11
Programming Model
• NewfunctionsareneededtosupportdependencyinCUDA• Adddependency• Policysettings
• ProposingDepLinksmodel• Wouldassignadependencygraphgenerationfunctiontoakernel• Easytolearnanduse
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �12
0
4 1
8 5 2
12 9 6
13 10 7
14 11
15
3
Wireframe Pseudo-code
parent1:=(X-1,Y)parent2:=(X,Y-1)
RUN:-KernelLaunch(DepLinks)
DepLinks@BLOCK(X,Y):-AddDependency(parent1)-AddDependency(parent2)
XY
(0,0)
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �12
0
4 1
8 5 2
12 9 6
13 10 7
14 11
15
3
Wireframe Pseudo-code
parent1:=(X-1,Y)parent2:=(X,Y-1)
RUN:-KernelLaunch(DepLinks)
DepLinks@BLOCK(X,Y):-AddDependency(parent1)-AddDependency(parent2)
6
5 2
parent1 parent2
XY
(0,0)
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �12
0
4 1
8 5 2
12 9 6
13 10 7
14 11
15
3
Wireframe Pseudo-code
parent1:=(X-1,Y)parent2:=(X,Y-1)
RUN:-KernelLaunch(DepLinks)
DepLinks@BLOCK(X,Y):-AddDependency(parent1)-AddDependency(parent2) Onekernellaunch!
6
5 2
parent1 parent2
XY
(0,0)
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �13
Dependency Graph
• Parentcountandlevelofeverynodedeterminedatruntime
• SenttotheGPU’sglobalmemory
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �14
Node Renaming
• Tominimizedatalevelrangeinthebuffers
Level0
Level1
Level2
Level3
Level4
Level5
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �15
Dependency-Aware TB Scheduler (DATS)
• Threadblockscheduler• Issuestherelevantthreadblockatthetimeforexecutionbasedonthedependencygraph
• DependencyGraphBuffer(DGB)• Cachedatafromglobalmemory• Challenge:Efficientcachinganddatautilization
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �16
Dependency-Aware TB Scheduler (DATS)
Datastoredincompressedsparse(CSR)format• Toreducememoryusage
• Threadblocksà NodeArray• Dependenciesà EdgeArray
• spacecomplexity
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �17
DATS Overview
0 2 4 6 7 9 11 12 14
1 2 3 4 4 5 6 6 7 7 8 9 8 9 9
GlobalNodeArray
GlobalEdgeArray
GLOBALMEMORYGlobalEdgeStart
GlobalNodeID
0 1 2 3 4 5 6 7 8 9 10 11 1312 14
0 1 2 3 4 5 6 7 8
EdgeStartGlobalNodeID
ParentCounter Level
32bits 16bits 16bits 16bits
BasePointer
TranslationofGlobalEdgeStarttoLocalEdgeStart𝐿𝐸𝑆𝑖 = (𝐺𝐸𝑆𝑖)% 𝐸𝑑𝑔𝑒𝐴𝑟𝑟𝑎𝑦
0 2 3 40 1 2 3
6 7 7 8 8 5 6
DEPENDENCYGRAPHBUFFER(DGB)LocalNodeArray
LocalEdgeArray
LocalEdgeStart
GlobalNodeIDH
0 1 2 3 4 5 6
PendingUpdateBuffer
NodeInsertionBuffer
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �17
DATS Overview
0 2 4 6 7 9 11 12 14
1 2 3 4 4 5 6 6 7 7 8 9 8 9 9
GlobalNodeArray
GlobalEdgeArray
GLOBALMEMORYGlobalEdgeStart
GlobalNodeID
0 1 2 3 4 5 6 7 8 9 10 11 1312 14
0 1 2 3 4 5 6 7 8
EdgeStartGlobalNodeID
ParentCounter Level
32bits 16bits 16bits 16bits
BasePointer
TranslationofGlobalEdgeStarttoLocalEdgeStart𝐿𝐸𝑆𝑖 = (𝐺𝐸𝑆𝑖)% 𝐸𝑑𝑔𝑒𝐴𝑟𝑟𝑎𝑦
0 2 3 40 1 2 3
6 7 7 8 8 5 6
DEPENDENCYGRAPHBUFFER(DGB)LocalNodeArray
LocalEdgeArray
LocalEdgeStart
GlobalNodeIDH
0 1 2 3 4 5 6
PendingUpdateBuffer
NodeInsertionBuffer
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �17
DATS Overview
0 2 4 6 7 9 11 12 14
1 2 3 4 4 5 6 6 7 7 8 9 8 9 9
GlobalNodeArray
GlobalEdgeArray
GLOBALMEMORYGlobalEdgeStart
GlobalNodeID
0 1 2 3 4 5 6 7 8 9 10 11 1312 14
0 1 2 3 4 5 6 7 8
EdgeStartGlobalNodeID
ParentCounter Level
32bits 16bits 16bits 16bits
BasePointer
TranslationofGlobalEdgeStarttoLocalEdgeStart𝐿𝐸𝑆𝑖 = (𝐺𝐸𝑆𝑖)% 𝐸𝑑𝑔𝑒𝐴𝑟𝑟𝑎𝑦
0 2 3 40 1 2 3
6 7 7 8 8 5 6
DEPENDENCYGRAPHBUFFER(DGB)LocalNodeArray
LocalEdgeArray
LocalEdgeStart
GlobalNodeIDH
0 1 2 3 4 5 6
PendingUpdateBuffer
NodeInsertionBuffer
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �17
DATS Overview
0 2 4 6 7 9 11 12 14
1 2 3 4 4 5 6 6 7 7 8 9 8 9 9
GlobalNodeArray
GlobalEdgeArray
GLOBALMEMORYGlobalEdgeStart
GlobalNodeID
0 1 2 3 4 5 6 7 8 9 10 11 1312 14
0 1 2 3 4 5 6 7 8
EdgeStartGlobalNodeID
ParentCounter Level
32bits 16bits 16bits 16bits
BasePointer
TranslationofGlobalEdgeStarttoLocalEdgeStart𝐿𝐸𝑆𝑖 = (𝐺𝐸𝑆𝑖)% 𝐸𝑑𝑔𝑒𝐴𝑟𝑟𝑎𝑦
0 2 3 40 1 2 3
6 7 7 8 8 5 6
DEPENDENCYGRAPHBUFFER(DGB)LocalNodeArray
LocalEdgeArray
LocalEdgeStart
GlobalNodeIDH
0 1 2 3 4 5 6
PendingUpdateBuffer
NodeInsertionBuffer
(Circularbuffer)
7 9 11 12
GlobalIDEdgeindex
LocalEdgeStartNodeindex
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �18
Node State Table
State R W W W
ParentCount 0 1 1 1
Level 0 1 1 2
GlobalNodeID 0 1 2 3
States:WaitReady
ProcessingDone
0 2 4 6
1 2 3 4 4 5 6
Tail Head
LocalEdgeStartLocalNodeArray
LocalEdgeArray
GlobalNodeID
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �18
Node State Table
State R W W W
ParentCount 0 1 1 1
Level 0 1 1 2
GlobalNodeID 0 1 2 3
States:WaitReady
ProcessingDone
0 2 4 6
1 2 3 4 4 5 6
T H
LocalEdgeStart
GlobalNodeID
0
2
5
1
4
7
3
6
8
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �19
Example: Child Node Execution
State R W W W
ParentCount 0 1 1 1
Level 0 1 1 2
GlobalNodeID 0 1 2 3
States:WaitReady
ProcessingDone
1 2 3 4 4 5 6
T H
0 2 4 6
0
2
5
1
4
7
3
6
8
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �19
Example: Child Node Execution
State R W W W
ParentCount 0 1 1 1
Level 0 1 1 2
GlobalNodeID 0 1 2 3
States:WaitReady
ProcessingDone
1 2 3 4 4 5 6
T H
0 2 4 6
1
D1
0
2
5
1
4
7
3
6
8
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �19
Example: Child Node Execution
State R W W W
ParentCount 0 1 1 1
Level 0 1 1 2
GlobalNodeID 0 1 2 3
States:WaitReady
ProcessingDone
1 2 3 4 4 5 6
T H
0 2 4 6
1
2 20 0
D1
0
2
5
1
4
7
3
6
8
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �19
Example: Child Node Execution
State R W W W
ParentCount 0 1 1 1
Level 0 1 1 2
GlobalNodeID 0 1 2 3
States:WaitReady
ProcessingDone
1 2 3 4 4 5 6
T H
0 2 4 6
1
2 2
3 3R R
0 0
D1
0
2
5
1
4
7
3
6
8
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �19
Example: Child Node Execution
State R W W W
ParentCount 0 1 1 1
Level 0 1 1 2
GlobalNodeID 0 1 2 3
States:WaitReady
ProcessingDone
1 2 3 4 4 5 6
T H
0 2 4 6
1
2 2
3 3R R
0 0
D
4
1
0
2
5
1
4
7
3
6
8
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �20
Example: Update Buffer Store
State D P P W
ParentCount 0 0 0 1
Level 0 1 1 2
GlobalNodeID 0 1 2 3
States:WaitReady
ProcessingDone
1 2 3 4 4 5 6
T H
0 2 4 6
0
2
5
1
4
7
3
6
8
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �20
Example: Update Buffer Store
State D P P W
ParentCount 0 0 0 1
Level 0 1 1 2
GlobalNodeID 0 1 2 3
States:WaitReady
ProcessingDone
1 2 3 4 4 5 6
T H
0 2 4 6
1
D1
0
2
5
1
4
7
3
6
8
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �20
Example: Update Buffer Store
State D P P W
ParentCount 0 0 0 1
Level 0 1 1 2
GlobalNodeID 0 1 2 3
States:WaitReady
ProcessingDone
1 2 3 4 4 5 6
T H
0 2 4 6
1
D1
0
2
5
1
4
7
3
6
8
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �20
Example: Update Buffer Store
State D P P W
ParentCount 0 0 0 1
Level 0 1 1 2
GlobalNodeID 0 1 2 3
States:WaitReady
ProcessingDone
1 2 3 4 4 5 6
T H
0 2 4 6
1 2
#4 #5
D
2PendingUpdateBuffer
1
0
2
5
1
4
7
3
6
8
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �21
Example: Invalidation…
State D P D W
ParentCount 0 0 0 1
Level 0 1 1 2
GlobalNodeID 0 1 2 3
States:WaitReady
ProcessingDone
1 2 3 4 4 5 6
T H
0 2 4 6 #4 #5
0
2
5
1
4
7
3
6
8
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �21
Example: Invalidation…
State D P D W
ParentCount 0 0 0 1
Level 0 1 1 2
GlobalNodeID 0 1 2 3
States:WaitReady
ProcessingDone
1 2 3 4 4 5 6
T H
0 2 4 6 #4 #5
1
D
0
2
5
1
4
7
3
6
8
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �21
Example: Invalidation…
State D P D W
ParentCount 0 0 0 1
Level 0 1 1 2
GlobalNodeID 0 1 2 3
States:WaitReady
ProcessingDone
1 2 3 4 4 5 6
T H
0 2 4 6 #4 #5
1
2
3 R0
D
0
2
5
1
4
7
3
6
8
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �21
Example: Invalidation…
State D P D W
ParentCount 0 0 0 1
Level 0 1 1 2
GlobalNodeID 0 1 2 3
States:WaitReady
ProcessingDone
1 2 3 4 4 5 6
T H
0 2 4 6 #4 #5 #4
1
2
3 R0
4
D
0
2
5
1
4
7
3
6
8
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �21
Example: Invalidation…
State D P D W
ParentCount 0 0 0 1
Level 0 1 1 2
GlobalNodeID 0 1 2 3
States:WaitReady
ProcessingDone
1 2 3 4 4 5 6
T H
0 2 4 6 #4 #5 #4
1
2
3 R0
4
5
D
0
2
5
1
4
7
3
6
8
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �21
Example: Invalidation…
State D P D W
ParentCount 0 0 0 1
Level 0 1 1 2
GlobalNodeID 0 1 2 3
States:WaitReady
ProcessingDone
1 2 3 4 4 5 6
T H
0 2 4 6 #4 #5 #4
EnoughspacestoloadtoDGB
1
2
3 R0
4
5
D
0
2
5
1
4
7
3
6
8
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �22
Example: …Reloading data
State W W D R
ParentCount 2 1 0 0
Level 2 2 1 2
GlobalNodeID 4 5 2 3
States:WaitReady
ProcessingDone
6 7 7 - - - 6
T H
0 2 - 6 #4 #5 #4
Loadcomplete!
0
2
5
1
4
7
3
6
8
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �22
Example: …Reloading data
State W W D R
ParentCount 2 1 0 0
Level 2 2 1 2
GlobalNodeID 4 5 2 3
States:WaitReady
ProcessingDone
6 7 7 - - - 6
T H
0 2 - 6 #4 #5 #4
Loadcomplete!
0
2
5
1
4
7
3
6
8
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �22
Example: …Reloading data
State W W D R
ParentCount 2 1 0 0
Level 2 2 1 2
GlobalNodeID 4 5 2 3
States:WaitReady
ProcessingDone
6 7 7 - - - 6
T H
0 2 - 6 #4 #5 #4
Loadcomplete!
6
0
2
5
1
4
7
3
6
8
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �23
Example: Update Buffer Load
State W W D P
ParentCount 2 1 0 0
Level 2 2 1 2
GlobalNodeID 4 5 2 3
States:WaitReady
ProcessingDone
6 7 7 - - - 6
TH
0 2 - 6 #4 #5 #4
0
2
5
1
4
7
3
6
8
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �23
Example: Update Buffer Load
State W W D P
ParentCount 2 1 0 0
Level 2 2 1 2
GlobalNodeID 4 5 2 3
States:WaitReady
ProcessingDone
6 7 7 - - - 6
TH
0 2 - 6 #5
1
3
10 2
R
0
2
5
1
4
7
3
6
8
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �23
Example: Update Buffer Load
State W W D P
ParentCount 2 1 0 0
Level 2 2 1 2
GlobalNodeID 4 5 2 3
States:WaitReady
ProcessingDone
6 7 7 - - - 6
TH
0 2 - 6
1
3
4
10 2
R
0
R
0
2
5
1
4
7
3
6
8
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �24
Challenges
•Minimizingglobalmemoryusage• UsedCSRformat
•Minimizingthebuffersize• LimitLevelRange
• LocalNode/EdgeArraysize
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �25
Level Range
• UnbalancedexecutionmayentailusingthebaselineTBschedulingpolicy(LRR).
Samplebenchmark(HEAT2D)w/LRRscheduler
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �26
Level Range
• Unboundedlevelrangemeans:• LargerDGBisrequired• LimitingTBexecution
Keychallenge:Efficientscheduling
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �27
Level-bound Scheduling (LVL)• Prioritizinglower-levelthreadblocksinthegraph• Morereadynodesà Moreparallelism• Minimizingthebufferingoperation• Limitingthelevelrangetoavoidserialization
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �28
Local Node Array Size
• Empiricalestimationused
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �28
Local Node Array Size
• Empiricalestimationused• Reducesize
• Untilperformancesuffers
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �28
IPC
0
125
250
375
500
Max
Upd
ate
buffe
r S
ize
(ent
ries)
0
40
80
120
160
Local Node Array Size (entries)
64 128 192 320 512 576 640
LRR_PUB LVL_PUB LRR_IPC LVL_IPC
Local Node Array Size
• Empiricalestimationused• Reducesize
• Untilperformancesuffers
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �28
IPC
0
125
250
375
500
Max
Upd
ate
buffe
r S
ize
(ent
ries)
0
40
80
120
160
Local Node Array Size (entries)
64 128 192 320 512 576 640
LRR_PUB LVL_PUB LRR_IPC LVL_IPC
Local Node Array Size
• Empiricalestimationused• Reducesize
• Untilperformancesuffers
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �28
IPC
0
125
250
375
500
Max
Upd
ate
buffe
r S
ize
(ent
ries)
0
40
80
120
160
Local Node Array Size (entries)
64 128 192 320 512 576 640
LRR_PUB LVL_PUB LRR_IPC LVL_IPC
Local Node Array Size
• Empiricalestimationused• Reducesize
• Untilperformancesuffers
Sizechosen(128entries)
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �28
IPC
0
125
250
375
500
Max
Upd
ate
buffe
r S
ize
(ent
ries)
0
40
80
120
160
Local Node Array Size (entries)
64 128 192 320 512 576 640
LRR_PUB LVL_PUB LRR_IPC LVL_IPC
Local Node Array Size
• Empiricalestimationused• Reducesize
• Untilperformancesuffers
• LVLsaves64%PUBsize
Sizechosen(128entries)
64%PUBsizereduction
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �29
Local Edge Array Size
• Empiricalestimationused
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �29
Local Edge Array Size
• Empiricalestimationused
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �29
Local Edge Array Size
• Empiricalestimationused• LVLrequires75%lessstorage
75%EdgeArrayreduction
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �29
Local Edge Array Size
• Empiricalestimationused• LVLrequires75%lessstorage• 256entries
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �30
Evaluation
• Evaluationplatform• GPGPU-Simv3.2.2(GTX480)• Sixdatadependency-heavybenchmarks
• Cases• Global,CDP• DepLinksprimitives• LRRandLVL
• LVL=3
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �31
Performance Breakdown 4KGraphSize
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �31
Performance Breakdown
Nor
mal
ized
Sp
eedu
p
0.9
1.025
1.15
1.275
1.4
DTW
HEAT2DHIST
INT_IM
GSOR SW
Averag
e
LVL LRR DepLinks CDP Global4KGraph
Size
NOTaccountingfordatatransfertime
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �31
Performance Breakdown
Nor
mal
ized
Sp
eedu
p
0.9
1.025
1.15
1.275
1.4
DTW
HEAT2DHIST
INT_IM
GSOR SW
Averag
e
LVL LRR DepLinks CDP Global4KGraph
Size
Theoreticalvaluebasedon[1]
[1]JinWang,NormRubin,AlbertSidelnik,andSudhakarYalamanchili.2016.Dynamicthreadblocklaunch:AlightweightexecutionmechanismtosupportirregularapplicationsonGPUs.
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �31
Performance Breakdown
Nor
mal
ized
Sp
eedu
p
0.9
1.025
1.15
1.275
1.4
DTW
HEAT2DHIST
INT_IM
GSOR SW
Averag
e
LVL LRR DepLinks CDP Global4KGraph
Size
BarriersenforcedbyDepLinksinstead;kernellaunchoverheadremoved
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �31
Performance Breakdown
Nor
mal
ized
Sp
eedu
p
0.9
1.025
1.15
1.275
1.4
DTW
HEAT2DHIST
INT_IM
GSOR SW
Averag
e
LVL LRR DepLinks CDP Global4KGraph
Size
Barriersremoved.Nodescannowrunahead.
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �31
Performance Breakdown
Nor
mal
ized
Sp
eedu
p
0.9
1.025
1.15
1.275
1.4
DTW
HEAT2DHIST
INT_IM
GSOR SW
Averag
e
LVL LRR DepLinks CDP Global4KGraph
Size
Level-boundTBscheduling
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �32
Ove
rall
Spee
dup
(LVL
)
0.9
1.1
1.3
1.5
1.7
DTW HEAT2D HIST INT_IMG SOR SW GeoMean
1K 4K 9K
Performance
• Speedupacrossdifferentgraphsizes
+45%
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �32
Ove
rall
Spee
dup
(LVL
)
0.9
1.1
1.3
1.5
1.7
DTW HEAT2D HIST INT_IMG SOR SW GeoMean
1K 4K 9K
Performance
• Speedupacrossdifferentgraphsizes
+45%
+65%
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �33
Evaluation Summary
• 2KBareaoverhead
• NosignificantimpactonL2missrate
• Lowglobalmemoryrequestoverhead• 0.13%Average
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �34
Conclusion
• PresentingWireframe,hardwaresupportforGPUdatadependency
• Supportinggeneralizedinter-blockdependenciesthroughhardware
• Minimizingbufferingthroughlevel-boundTBscheduling
• 45%averagespeedupimprovementoverthebaseline
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �35
Thank you!
Questions?
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �36
Computations vs Launch Overhead
• Withaconstantdatasize• Kernellaunchesincreasewithgraphsize
• isstillsizableat9Knodes.• timesonaverage
Com
p/La
unch
R
atio
0
3.5
7
10.5
14
DTW HIST SOR GeoMean
1K 4K9K
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �37
Performance
• ImpactonL2~0.5%
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �38
Performance (IPC)
WIREFRAME:SupportingData-dependentParallelisminGPUs MICRO50 �39
Thank you!
Questions?