S6357 Towards Efficient Communication Methods …on-demand.gputechconf.com/gtc/2016/presentation/s6357...S6357 Towards Efficient Communication Methods and Models for Scalable GPU-Centric

S6357 Towards Efficient Communication Methods and Models for

Scalable GPU-Centric Computing Systems

Holger Fröning Computer Engineering Group

Ruprecht-Karls University of Heidelberg

GPU Technology Conference 2016

Aboutus

• JProf. Dr. Holger Fröning • PI of Computer Engineering Group, ZITI, Ruprecht-Karls University of Heidelberg • http://www.ziti.uni-heidelberg.de/compeng

• Research: Application-specific computing under hard power and energy constraints (HW/SW), future emerging technologies • High-performance computing (traditional and emerging) • GPU computing (heterogeneity & massive concurrency) • High-performance analytics (scalable graph computations) • Emerging technologies (approximate computing and stacked memory) • Reconfigurable logic (digital design and high-level synthesis)

• Current collaborations • Nvidia Research, US; University of Castilla-La Mancha, Albacete, Spain; CERN, Switzerland; SAP,

Germany; Georgia Institute of Technology, US; Technical University of Valencia, Spain; TU Graz, Austria; various companies

!2

http://www.ziti.uni-heidelberg.de/compeng

TheProblem

• GPU are powerful high-core count devices, but only for in-core computations • But many workloads cannot be satisfied by a single GPU • Technical computing, graph computations, data-

warehousing, molecular dynamics, quantum chemistry, particle physics, deep learning, spiking neural networks

• => Multi-GPU, at node level and at cluster level • Hybrid programming models

• While single GPUs are rather simple to program, interactions between multiple GPUs dramatically push complexity!

• This talk: how good are GPUs in sourcing/sinking network traffic, how should one orchestrate communication, what do we need for best performance/energy efficiency

!3

Review:Messaging-basedCommunication

• Usually Send/Receive or Put/Get • MPI as de-facto standard

• Work requests descriptors • Issued to the network device • Target node, source pointer,

length, tag, communication method, ...

• Irregular accesses, little concurrency

• Memory registration • OS & driver interactions

• Consistency by polling on completion notifications

!4

Lena Oden, Holger Fröning, Infiniband-Verbs on GPU: A case study of controlling an Infiniband network device from the GPU, International Journal of High Performance Computing Applications, Special Issue on Applications for the Heterogeneous Computing Era, Sage Publications, 2015.

BeyondCPU-centriccommunication

!5

Source'Node'

GPU'

CPU'

NIC'PCIe'root'

GPU'memory'

Host'memory'

Target'Node'

GPU'

CPU'

NIC'PCIe'root'

GPU'memory'

Host'memory'

100x

Start-uplatencyof1.5usec

Start-uplatencyof15usec

GPU-controlled Put/Get (IBVERBS)

“… a bad semantic match between communication primitives required

by the application and those provided by the network.” - DOE Subcommittee Report, Top Ten Exascale Research Challenges.

02/10/2014

!5

GPUsratherincompatiblewithmessaging:•Constructingdescriptors(workrequests)

•Registeringmemory•Polling•Controllingnetworkingdevices

Communication orchestration - how to source and sink network traffic

Exampleapplication:3Dstencilcode

• Himeno 3D stencil code • Solving a Poisson equation using

2D CTAs (marching planes) • Multiple iterations using iterative

kernel launches • Multi-GPU: inter-block and inter-

GPU dependencies • Dependencies => communication

• Inter-block: device synchronization required among adjacent CTAs

• Inter-GPU: all CTAs participate communications (sourcing and sinking) => device synchronization required

!7

Control flow using CPU-controlled communication

3D Himeno stencil code with 2D CTAs

……

Differentformsofcommunicationcontrolforanexamplestencilcode

!8

Control flow using in-kernel synchronization

Control flow using stream synchronization (with/without nested parallelism)

Performancecomparison-executiontime

!9

0"0,1"0,2"0,3"0,4"0,5"0,6"0,7"0,8"0,9"1"

256x256x256"

256x256x512"

256x256x1024"

512x512x256"

512x512x512"

512x512x640"

640x640x128"

640x640x256"

640x640x386"

Perfom

ance**

Rela-v

e*to*hybrid

*app

roach*

in0kernel0sync" stream0sync" device0sync"

• CPU-controlled still fastest • Backed up by previous experiments

• In-kernel synchronization slowest • Communication overhead increases with

problem size: more CTAs, more device synchronization

• ~28% of all instructions have to be replayed, likely due to serialization (use of atomics)

• Stream synchronization a good option • Difference to device synchronization is

overhead of nested parallelism • Device synchronization most flexible

regarding control flow • Communication as device function or as

independent kernel • Flexibility in kernel launch configuration

Lena Oden, Benjamin Klenk, Holger Fröning, Analyzing GPU-controlled Communication and Dynamic Parallelism in Terms of Performance and Energy, Elsevier Journal of Parallel Computing (ParCo), 2016.

Performancecomparison-energyconsumption

!10

0"

0,2"

0,4"

0,6"

0,8"

1"

1,2"

256x256x256"

256x256x512"

256x256x1024"

512x512x256"

512x512x512"

512x512x640"

640x640x128"

640x640x256"

640x640x386"

Energy'con

sump.

on'

Rela.v

e'to'hybrid

'app

roach'

stream2sync" device2sync" in2kernel2sync"

• Benefits for stream/device synchronization as the CPU is put into sleep mode • 10% less energy consumption • CPU: 20-25W saved

• In-kernel synchronization saves much more total power, but execution time increase results in a higher energy consumption • Likely bad GPU utilization

Lena Oden, Benjamin Klenk, Holger Fröning, Analyzing GPU-controlled Communication and Dynamic Parallelism in Terms of Performance and Energy, Elsevier Journal of Parallel Computing (ParCo), 2016.

Communicationorchestration-Takeaways

• CPU-controlled communication is still fastest - independent of different orchestration optimizations

• GPU-controlled communication: intra-GPU synchronization between the individual CTAs is most important for performance • Stream synchronization most promising • Otherwise reply overhead due to serialization

• Dedicated communication kernels or functions are highly recommended • Either device functions for master kernel (nested parallelism), or communication

kernels in the same stream (issued by CPU) • Bypassing CPUs has substantial energy advantages

• Decrease polling rates, or use interrupt-based CUDA events! • More room for optimizations left

!11

while( cudaStreamQuery(stream) == cudaErrorNotReady ) usleep(sleeptime};

GGAS: Fast GPU-controlled traffic sourcing and sinking

GGAS–GlobalGPUAddressSpaces

• Forwarding load/store operations to global addresses • Address translation and

target identification • Special hardware support

required (NIC) • Severe limitations for full

coherence and strong consistency • Well known for CPU-based

distributed shared memory • Reverting to highly relaxed

consistency models can be a solution

!13

. Holger Fröning and Heiner Litz, Efficient Hardware Support for the Partitioned Global Address Space, 10th Workshop on Communication Architecture for Clusters (CAC2010), co-located with 24th International Parallel and Distributed Processing Symposium (IPDPS 2010), April 19, 2010, Atlanta, Georgia.

!14

GGAS–thread-collaborativeBSP-likecommunication

Lena Oden and Holger Fröning, GGAS: Global GPU Address Spaces for Efficient Communication in Heterogeneous Clusters, IEEE International Conference on Cluster Computing 2013, September 23-27, 2013, Indianapolis, US.

Computa(on…

…Communica(onusingcollec(ve

remotestores

…Con(nue…

Globalbarrier

Computa(on

GPU0

… Computa(on

…Con(nue…

…

GPU1(remote)

GGAS–currentprogrammingmodelusingmailboxes

!15

<snip> ... remMailbox[getProcess(index)][tid] = data[tid]; __threadfence_system(); // memory fence remoteEndFlag[getProcess(index)][0] = 1; __ggas_barrier(); … <snip>

GGASPrototype

!16

RemoteloadlatencyVirtex-6:1.44–1.9usec

(CPU/GPU)Node #0 (Source) Issuing loads/stores

Node #1 (Target) Memory host

Source-localaddressTargetnodedetermination

Addresscalculation

GlobaladdressLoss-lessandin-orderpacketforwarding

Target-localaddressSourcetagmanagement

AddresscalculationReturnroute

• FPGA-based network prototype • Xilinx Virtex-6 • 64bit data paths, 156MHz =

1.248GB/s (theoretical peak) • PCIe G1/G2 • 4 network links (torus topology)

GGAS–Microbenchmarking

• GPU-to-GPU streaming • Prototype system consisting of Nvidia

K20c & dual Intel Xeon E5 • Relative results applicable to

technology-related performance improvements

• MPI • CPU-controlled: D2H, MPI send/recv,

H2D • GGAS

• GPU-controlled: GDDR to GDDR, remote stores

• RMA: Remote Memory Access • Put/Get-based, CPU-to-CPU (host)

resp. GPU-to-GPU (direct)!17

GGASlatencystartingat1.9usec

P2PPCIeissue

Allreduce–PowerandEnergyanalysis

Lena Oden, Benjamin Klenk and Holger Fröning, Energy-Efficient Collective Reduce and Allreduce Operations on Distributed GPUs, 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid2014), May 26-29, 2014, Chicago, IL, US. !18

Forthiscase:50%oftheenergysaved

GGAS MPI

Analyzing Communication Models for Thread-parallel Processors

CommunicationModels

• MPI • CPU-controlled • De-facto standard, widely used, heavily optimized (for CPUs)

• GGAS (using mailboxes with send/receive semantics) • GPU-controlled • Communication by forwarding load/store operations using global address spaces • Completely in-line with the GPU execution model => highly thread parallel • Main drawback is reduced overlap

• RMA: Remote Memory Access • GPU-controlled • Put/Get operations of the custom interconnect • Communication engine designed for HPC • GPUs have to construct/interpret descriptors (which was very crucial for the IBVERBS

experiment)

!20

CompletingGGASwithPut/Getoperations

• Descriptor-based Put/Get operations • Completely asynchronous

• Work request with • Type • Local/Remote pointers • Credentials • Notification requests

• Notification • Completion information

with reference to work request

• Key is a simple descriptor format

!21

3 2 1 0

1

0

2

Dou

blew

ord

4

3

5

1011 89 67 45 23 11920 1718 1516 1314 122122232425 0262728293031

Read Address[63:0](64 bit)

Write Address[63:0](64 bit)

Destination Node(16 bit)

Payload Size (byte)(23 bit)

RSV

D.Len (2 b)

CMD(4 bit)

NOTI(3 bit)

ERA

NTR

MC

RSV

RSV

EWA

Destination VPID(8 bit)

Bytebit

Put/Get:DifferentWorkRequestQueueImplementations

• Descriptor format and queue organization matters • Explicit/implicit trigger, conversion effort, descriptor complexity

• FPGA implementation: we could even change the descriptor format

!22

IBVERBS descriptor format Custom descriptor format

Exampleapplications

• Testing is hard as applications have to be re-written for GPU-centric communication

• Set of 4 workloads implemented:

!23

nbody_small nbody_large sum_small sum_large himeno randomAccess

benchmarks

perfo

rman

ce n

orm

alize

d to

MPI

0.0

1.0

2.0

3.0

2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12

GGAS RMA

Performancecomparison-executiontime

!24Benjamin Klenk, Lena Oden, Holger Fröning, Analyzing Communication Models for Distributed Thread-Collaborative Processors in Terms of Energy and

Time, 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2015), Philadelphia, PA, March 29-31, 2015.

• 2-12 nodes (each 2x Intel Ivy Bridge, Nvidia K20, FPGA network) • Normalized to MPI: >1 = better performance, <1 = worse performance

Performancecomparison-energyconsumption

!25Benjamin Klenk, Lena Oden, Holger Fröning, Analyzing Communication Models for Distributed Thread-Collaborative Processors in Terms of Energy and

Time, 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2015), Philadelphia, PA, March 29-31, 2015.

NB−S NB−L sum−S sum−L RA Himeno avg.

benchmarksener

gy c

onsu

mpt

ion

norm

alize

d to

MPI

0.0

0.4

0.8

1.2

ggas rma

• 12 nodes (each 2x Intel Ivy Bridge, Nvidia K20, Extoll FPGA) • Normalized to MPI: <1 = better energy consumption, >1 = worse energy consumption

Performancecomparison-observations

!26

Observation N-Body

HimenoStencil

GlobalSum

RandomAccess

RMAandMPIofferabetterexploitationofoverlappossibilities X X

GGASperformsoutstandingforsmallpayloads,asnoindirectionsarerequiredlikecontextswitchestotheCPUorworkrequestissuestotheNIC X X X

ThePCIepeer-to-peerreadproblemresultsinMPIperformingbetterthanRMAorGGASforlargepayloadsizes X X

GGASincombinationwithRMAoutperformMPIsubstantially(withoutthePCIepeer-to-peerreadlimitation) X X X

Inpractice,theexecutiontimehasanessentialinfluenceonenergyconsumption X X X

AccessestohostmemorycontributesignificantlytoDRAMandCPUsocketpower X

BypassingacomponentlikeaCPUcansaveenoughpowertocompensatealongerexecutiontime,resultinginenergysavings X

StagingcopiescontributesignificantlytobothCPUandGPUpower,duetoinvolvedsoftwarestacksrespectivelyactiveDMAcontrollers X

Forirregularcommunicationpatternsorsmallpayloads,GGASsavesbothtimeandenergy X (X)

Related effort: Simplified multi-GPU programming

• Single-GPU programming based on CUDA/OpenCL is a prime example for a BSP execution model • Exposes large amounts of structured parallelism, rather easy to use

• Multi-GPU programming becoming more important, but adds huge amounts of complexity • Processor aggregation easy • Memory aggregation

challenging • Local/remote bandwidth disparity • UVA (Unified Virtual Addressing)

works out-of-the-box -> poor performance

• Solution: Use a compiler-based approach to automatically partition regular GPU code

Towardssimplifiedmulti-GPUprogramming

!28

GPUMekong

• Tool stack based on LLVM tool chain • Code analysis • Automated decision making • Code transformations for

automated partitioning and data movements

!29

• Regularity• Input/Outputdata,dimensionality

• Additionofsuper-blockID

• Indexmodification

• Executedkernels• Iterativeexecution?

• Multi-Deviceinitialization

• Datadistribution

GPUMekong

• Funded by a Google Research Award • Under heavy development • https://sites.google.com/site/gpumekong • Initial results for a 16-GPU matrix multiply

!30Alexander Matz, Mark Hummel, Holger Fröning, Exploring LLVM Infrastructure for Simplified Multi-GPU Programming, Ninth International Workshop on

Programmability and Architectures for Heterogeneous Multicores (MULTIPROG-2016), in conjunction with HiPEAC 2016, Prague, Czech Republic, Jan. 18, 2016.

0.25

0.50

0.75

1.00

2 4 6 8Number of GPUs

Spee

dup

2

4

6

4 8 12 16Number of GPUs

Spee

dup

N (NxN matrices) 4096 8192 12288 16384 20480 24576

https://sites.google.com/site/gpumekong

Wrapping up

Summary

• GPUs as first-class citizens in a peer networking environment, capable of sourcing and sinking traffic • Traditional messaging libraries and models poorly match the GPU’s thread-

collaborative execution model • Also GPUs require fast communication paths with minimal latency/maximum

message rate, combined with high-overlap paths like Put/Get

• However: the semantic gap between architecture and user is growing • We need automated tooling to close this gap

• Unified communication models • Automatically selecting the right communication method and path between

heterogeneous computing units => flexibility of scheduling kernels • Multiple communication models will dramatically push complexity, too

• CACM 03/2015, John Hennessy: it’s the era of software!32

SpecializedprocessorslikeGPUsrequirespezializedcommunicationmodels/methods

Thank you!

CreditsContribuYons:LenaOden(formerPhDstudent),BenjaminKlenk(PhDstudent),

DanielSchlegel(graduatestudent),GüntherSchindler(graduatestudent)

Discussions:SudhaYalamanchili(GeorgiaTech),JeffYoung(GeorgiaTech),LarryDennison(Nvidia),HansEberle(Nvidia)

Sponsoring:Nvidia,Xilinx,GermanExcellenceIniYaYve,Google

CurrentmaininteracZons

!33

Documents

S6357 Towards Efficient Communication Methods …on-demand.gputechconf.com/gtc/2016/presentation/s6357...S6357 Towards Efficient Communication Methods and Models for Scalable GPU-Centric