37
Augmenting GPU Hardware through Software Innovations ---A Reflection on a Dozen Years of Efforts Computer Science, North Carolina State University Xipeng Shen

Xipeng Shen Computer Science, North Carolina State UniversityRealize efficient pipeline executions Automatically configure execution models ... “HiWayLib: A Software Framework for

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Xipeng Shen Computer Science, North Carolina State UniversityRealize efficient pipeline executions Automatically configure execution models ... “HiWayLib: A Software Framework for

Augmenting GPU Hardware through Software Innovations ---A Reflection on a Dozen Years of Efforts

ComputerScience,NorthCarolinaStateUniversity

XipengShen

Page 2: Xipeng Shen Computer Science, North Carolina State UniversityRealize efficient pipeline executions Automatically configure execution models ... “HiWayLib: A Software Framework for

�2

EvolutionofNVIDIAGPU

900

750

600

450

300

150

GB/s2008 2010 2012 2014 2016 2018

228GFlops

1050

Pascal

14.8TFlops

Volta

MemBandwidth

AdaptedfromNVIDIA’sPresentation

DynamicParallelism

CudaFP64

Tensorcores3DMemoryNVLinkUnifiedMemoryMixedPrecision

Page 3: Xipeng Shen Computer Science, North Carolina State UniversityRealize efficient pipeline executions Automatically configure execution models ... “HiWayLib: A Software Framework for

�3Credit:NVIDIA

Tremendous Successes

Page 4: Xipeng Shen Computer Science, North Carolina State UniversityRealize efficient pipeline executions Automatically configure execution models ... “HiWayLib: A Software Framework for

4

Three “Dark Clouds” Persist

IrregularMem&Control

Non-DataParallelism

PreemptiveScheduling

GPU

Page 5: Xipeng Shen Computer Science, North Carolina State UniversityRealize efficient pipeline executions Automatically configure execution models ... “HiWayLib: A Software Framework for

Thread block

Massive Parallelism

Streaming multiprocessor (SM)

A_kernel_function (…) { … [thread_ID] … … …}

�5

GPU

Each thread: a little work, a few data accesses.But hundreds of thousands of them.

Great for Regular Data-Parallel Computing.

Page 6: Xipeng Shen Computer Science, North Carolina State UniversityRealize efficient pipeline executions Automatically configure execution models ... “HiWayLib: A Software Framework for

Thread block

Massive Parallelism

Streaming multiprocessor (SM)

A_kernel_function (…) { … [thread_ID] … … …}

�6

GPU

• “Cloud”1:SPMTmodelisnotdesignedforothertypesofparallelism• E.g.,pipelineparallelism,taskparallelism

• “Cloud”2:GPUremainssensitivetodivergenceincontrols&accesses

• Memorycoalescing,threaddivergence

• “Cloud”3:Massivestatestostoreandrecoverforpreemption

• Despiteaddedhardwaresupport

Page 7: Xipeng Shen Computer Science, North Carolina State UniversityRealize efficient pipeline executions Automatically configure execution models ... “HiWayLib: A Software Framework for

7

Our Explorations on GPU PerformanceCompiler-based software solutions

11/069/07

5/096/10

3/1110/11

6/122/13

9/1312/14

5/156/15

12/156/16

2/17

CUDArelease

NVIDIALCPCtalk

IPDPScrossinputadap.opt.

ICSremovethreaddiverg.dyn.

ASPLOSGStreamline

PACTtreatsynch.correct.GPU2CPU

ICSsyn.relax.&opt.GPU2CPU

PPOPPmemcoalesc.

PACTNVMforGPU

MicroPORPLE

ICSSMcentric

HotOSCo-runonFused

MicroFreeLaunch

ICSMultiview

PPOPPEffiSha

10/17

MicroVersaPipe

5/17

IPDPSCo-sched.

ASPLOSHiWayLib

4/19

Page 8: Xipeng Shen Computer Science, North Carolina State UniversityRealize efficient pipeline executions Automatically configure execution models ... “HiWayLib: A Software Framework for

8

Three Persistent “Dark Clouds”

IrregularMem&Control

Non-DataParallelism

PreemptiveScheduling

GPU

Page 9: Xipeng Shen Computer Science, North Carolina State UniversityRealize efficient pipeline executions Automatically configure execution models ... “HiWayLib: A Software Framework for

�9

Pipeline Applications

Networkpacketprocessing

Imagerendering Facedetection

imagecredit:Whippletree,cs.adelaide.edu.au

Highthroughput&lowlatencybothdesired

DeepNeuralNetworks

Page 10: Xipeng Shen Computer Science, North Carolina State UniversityRealize efficient pipeline executions Automatically configure execution models ... “HiWayLib: A Software Framework for

�10

Kernel-by-Kernel

if(accumulatedEnoughJob){stage1_kernel(ManyJobs);stage2_kernel(ManyJobs);stage3_kernel(ManyJobs);

}

Run-to-Completion

if(accumulatedEnoughJob){gpuKernel{

stage1(oneJob);stage2(oneJob);stage3(oneJob);

}}

NeitherbuildsanactualpipelineonGPU.

Page 11: Xipeng Shen Computer Science, North Carolina State UniversityRealize efficient pipeline executions Automatically configure execution models ... “HiWayLib: A Software Framework for

�11

IsactualpipelinepossibleonGPU?

Requiresbothspatialandtemporalcontrolsofthreadsexecutions.

SM-centric Kernel + SW-based Scheduling

Page 12: Xipeng Shen Computer Science, North Carolina State UniversityRealize efficient pipeline executions Automatically configure execution models ... “HiWayLib: A Software Framework for

SM-Centric Kernel

12

Thread-centric task model ⇒ SM-centric task model

original_kernel: taskID = f (workerID); processTask (taskID);

ith worker

ith task

new_kernel: smID = getSMID(); taskID = JobQ [smID].next(); if (taskID!=NULL) processTask (taskID);

ith task queue

↕ith SM

⇒…

Simplified

[ICS’15]

Page 13: Xipeng Shen Computer Science, North Carolina State UniversityRealize efficient pipeline executions Automatically configure execution models ... “HiWayLib: A Software Framework for

�13

Kernel-by-Kernel

if(accumulateEnoughJob){stage1_kernel(ManyJobs);stage2_kernel(ManyJobs);stage3_kernel(ManyJobs);

}

Run-to-Completion

gpuKernel{stage1(oneJob);stage2(oneJob);stage3(oneJob);

}

New Exec. Models Enabled

Page 14: Xipeng Shen Computer Science, North Carolina State UniversityRealize efficient pipeline executions Automatically configure execution models ... “HiWayLib: A Software Framework for

14

KbK

responsiveness

Page 15: Xipeng Shen Computer Science, North Carolina State UniversityRealize efficient pipeline executions Automatically configure execution models ... “HiWayLib: A Software Framework for

VersaPipe

�15

Realizeefficientpipelineexecutions

Automaticallyconfigureexecutionmodels

Screencomplexityfromprogrammers

A Versatile Programming Framework for Pipelined Computing on GPU[Micro17]

https://github.com/JamesTheZ/VersaPipe

Page 16: Xipeng Shen Computer Science, North Carolina State UniversityRealize efficient pipeline executions Automatically configure execution models ... “HiWayLib: A Software Framework for

16

Speedups

Megakernel

Megakernel:Whippletree[TOG2014]

Default

Page 17: Xipeng Shen Computer Science, North Carolina State UniversityRealize efficient pipeline executions Automatically configure execution models ... “HiWayLib: A Software Framework for

PipelineBeyondGPU

�17

[Monday11:30amDataMovementIISession]

“HiWayLib:ASoftwareFrameworkforEnablingHighPerformanceCommunicationsforHeterogeneousPipelineComputations”

• Communicationspeed.

• Terminationchecks.

• Taskqueuecontention.

Page 18: Xipeng Shen Computer Science, North Carolina State UniversityRealize efficient pipeline executions Automatically configure execution models ... “HiWayLib: A Software Framework for

Dynamic Parallelism• The amount of parallelism is dynamic.

• Example: Breadth-First Search(BFS).

!18

0

11 1

?? ? ??

Graph

Level 0

Level 1

Level 2

1

?? ?

In parallel

1

??

In parallel

Parallelism is determined by the # of neighbors!

Many other applications: Graph Coloring,

Survey Propagation,

Shortest Path,

Minimal Spanning Tree,

Connected Components Labeling,

Page 19: Xipeng Shen Computer Science, North Carolina State UniversityRealize efficient pipeline executions Automatically configure execution models ... “HiWayLib: A Software Framework for

Current Support on GPU• Sub-kernel Launch (SKL)

!19

Main (…) { …

A_GPU_Kernel <<< … >>> (…); }

A_GPU_kernel ( … ){

… subkernel <<< … >>> ( … ); }

0

11 1

22 2 22

Page 20: Xipeng Shen Computer Science, North Carolina State UniversityRealize efficient pipeline executions Automatically configure execution models ... “HiWayLib: A Software Framework for

Large overhead of SKL

• Practices: Careful programming to use task lists etc.

• Programming productivity lost

• HW extensions (e.g.,[Kim+:Micro2014 [Wang+:ISCA2015])

• Add HW complexities

!20

Slow

dow

n (X

)

0

5.5

11

16.5

22

GC BFS MST_dfind2 AverageSP SSSP MSTv

Page 21: Xipeng Shen Computer Science, North Carolina State UniversityRealize efficient pipeline executions Automatically configure execution models ... “HiWayLib: A Software Framework for

Solution: Free Launch

• A compiler-based program transformation.

!21

Free Launch: Source-to-Source

Compiler

Program: Worklist style

Productivity & EfficiencyProgram: SKL style

Remove SKLRedistribute subtasks to

parent threads

[Micro’15]

Get the best of both worlds:

• Intuitive coding • Low overhead • Superb load balance

37X Speedup

Page 22: Xipeng Shen Computer Science, North Carolina State UniversityRealize efficient pipeline executions Automatically configure execution models ... “HiWayLib: A Software Framework for

22

Three Persistent “Dark Clouds”

IrregularMem&Control

Non-DataParallelism

PreemptiveScheduling

GPU

Page 23: Xipeng Shen Computer Science, North Carolina State UniversityRealize efficient pipeline executions Automatically configure execution models ... “HiWayLib: A Software Framework for

Xipeng Shen [email protected]

Dynamic Irregularities

23

A[ ]:

P[ ] = { 0, 5, 1, 7, 4, 3, 6, 2}

... = A[P[tid]];

tid: 0 1 2 3 4 5 6 7

Degrade throughput by up to (warp size) times. (warp size = 32 in modern GPUs)

memory

2 4 10 0 6 0 0A[ ]:

tid: 0 1 2 3 4 5 6 7 if (A[tid]) {...}

control flow (thread divergence)

for (i=0;i<A[tid]; i++) {...}

{a mem seg.

P[ ] = { 0, 1, 2, 3, 4, 5, 6, 7}

AproblemexistsonallGPUs.

Page 24: Xipeng Shen Computer Science, North Carolina State UniversityRealize efficient pipeline executions Automatically configure execution models ... “HiWayLib: A Software Framework for

Solution 1: Thread-Data Remapping

24

{a mem seg.

4 trans/warp

{

a mem seg.

1 trans/warp

Irregularity in a warp: problematic; across warps: okey!

Principle of solution:Turn intra-warp irreg. into reg. or inter-warp irreg.

Page 25: Xipeng Shen Computer Science, North Carolina State UniversityRealize efficient pipeline executions Automatically configure execution models ... “HiWayLib: A Software Framework for

G-Streamline[ASPLOS’2011, PPOPP’13]

25

1.08—2.5Xspeedups

First framework enabling runtime thread-data remapping.

CPU-GPU pipeline to hide transformation overhead.

Kernel splitting to resolve dependences.

Page 26: Xipeng Shen Computer Science, North Carolina State UniversityRealize efficient pipeline executions Automatically configure execution models ... “HiWayLib: A Software Framework for

Xipeng Shen [email protected]

Solution 2: Data Placement

!26

Global memory

Texture memory

Shared memory

Constant memory

(L1/L2 cache)(Read-only cache)

(Texture cache)

A

B

C

D

Data in a program

?????

3X performance difference

GPU

Page 27: Xipeng Shen Computer Science, North Carolina State UniversityRealize efficient pipeline executions Automatically configure execution models ... “HiWayLib: A Software Framework for

Xipeng Shen [email protected]

PORPLE: Portable Data Placement Engine

!27

[Micro’2014]

Offerportability

Offerinputadaptivity

Createplacement-agnosticcode

Page 28: Xipeng Shen Computer Science, North Carolina State UniversityRealize efficient pipeline executions Automatically configure execution models ... “HiWayLib: A Software Framework for

Observations

• DecentspeedupsonK20c,M2075,C1060.• frombetteruseoftexturemem,constantmem&cache.

• StillmatteronnewGPUs?

�28[PMBS’2018 by Bari et al.]

VariationsofDataPlacementsinMemory

SPMVGlobalmemisbaseline.

Page 29: Xipeng Shen Computer Science, North Carolina State UniversityRealize efficient pipeline executions Automatically configure execution models ... “HiWayLib: A Software Framework for

ProblemShiftsbeyondoneGPU

�29

NVIDIADGX2

Page 30: Xipeng Shen Computer Science, North Carolina State UniversityRealize efficient pipeline executions Automatically configure execution models ... “HiWayLib: A Software Framework for

UnifiedMemory

�30

LogicalView PhysicalView

Tuning-2[read-only]

Tuning-1[preferred]Default

credit:NVIDIA

Page 31: Xipeng Shen Computer Science, North Carolina State UniversityRealize efficient pipeline executions Automatically configure execution models ... “HiWayLib: A Software Framework for

31

Three Persistent “Dark Clouds”

IrregularMem&Control

Non-DataParallelism

PreemptiveScheduling

GPU

Page 32: Xipeng Shen Computer Science, North Carolina State UniversityRealize efficient pipeline executions Automatically configure execution models ... “HiWayLib: A Software Framework for

�32

GPU

Kernel A (long running, low priority)

Kernel B (high priority)

Application 1

Application 2

A major issue for responsiveness, priority, and fairness.

Wait, wait, wait …

BeforePascal:Nopreemptionsupport.

Page 33: Xipeng Shen Computer Science, North Carolina State UniversityRealize efficient pipeline executions Automatically configure execution models ... “HiWayLib: A Software Framework for

�33

EffiSha[PPOPP’2017]

Firstsoftware-controlledpreemptiveschedulingofGPUkernels.

Page 34: Xipeng Shen Computer Science, North Carolina State UniversityRealize efficient pipeline executions Automatically configure execution models ... “HiWayLib: A Software Framework for

Execution Time

34

transformedGPU program-1

. . .

CPU GPU

transformedGPU program-2

transformedGPU program-mSched.

policies

EffiSha runtime daemon

launching requests/

notificationskernel

(re)launch

eviction flag

evictable kernel-1

readwrite

evictable kernel-2

evictable kernel-1

Page 35: Xipeng Shen Computer Science, North Carolina State UniversityRealize efficient pipeline executions Automatically configure execution models ... “HiWayLib: A Software Framework for

�35

Still a Problem?SincePascal:Instruction-levelpreemptionbecomessupported.

• Contextsize:7MBSharedMem&20MBRegistersonV100.

• Memorycapacitylimitation/performanceinterference.

• Unsafeisolation(e.g.,leftoveronsharedmemory).

Page 36: Xipeng Shen Computer Science, North Carolina State UniversityRealize efficient pipeline executions Automatically configure execution models ... “HiWayLib: A Software Framework for

36

IrregularMem&Control

Non-DataParallelism

PreemptiveScheduling

Reliability Security

GPU

Two More Clouds

Page 37: Xipeng Shen Computer Science, North Carolina State UniversityRealize efficient pipeline executions Automatically configure execution models ... “HiWayLib: A Software Framework for

Conclusion

• GPU changes fast; some challenges persist • SW support is essential

• Focusing on principled barriers gives lasting value

• Call for continuous innovations • Versatile support of parallelism • Portable support of memory opt. (intra- & inter-dev) • Efficient support of safe sharing • Reliable support of critical uses

�37