High-Performance GPU Clustering: GPUDirect RDMA …on-demand.gputechconf.com/gtc/2016/presentation/s6854...Hadoop RDMA HPC iWARP RDMA over Ethernet GPUDirect RDMA Lustre RDMA pNFS

EfficientPerformance™ 1

High-PerformanceGPUClustering:GPUDirectRDMAover40GbEiWARP

TomReuConsultingApplicationsEngineerChelsioCommunicationstomreu@chelsio.com

mailto:[email protected]


• Leading10/40GbEadaptersolutionproviderforserversandstoragesystems• ~800Kportsshipped

• Highperformanceprotocolengine• 80MPPS• 1.5μsec• ~5M+IOPs

• Featurerichsolution• Mediastreaminghardware/software• WANOptimization,Security,etc.

• CompanyFacts• Foundedin2000• 150strongstaff

• R&DOffices• USA–Sunnyvale• India–Bangalore• China-Shanghai

ChelsioCorporateSnapshotLeaderinHighSpeedConvergedEthernetAdapters

MarketCoverage

Manufacturing

OilandGas Finance

Service/Cloud

Storage

Media

HPC

Security

OEMSnapshot

EfficientPerformance™

• Directmemory-to-memorytransfer• Allprotocolprocessinghandlingbythe

NIC• Mustbeinhardware

• ProtectionhandledbytheNIC• Userspaceaccessrequiresbothlocal

andremoteenforcement• Asynchronouscommunicationmodel

• Reducedhostinvolvement• Performance

• Latency-polling• Throughput

• Efficiency• Zerocopy• Kernelbypass(userspaceI/O)• CPUbypass

RDMAOverview

Performanceandefficiencyinreturnfornewcommunicationparadigm

ChelsioT5RNICChelsioT5RNIC


• ProvidestheabilitytodoRemoteDirectMemoryAccessoverEthernetusingTCP/IP

• UsesWell-KnownIBVerbs• InboxedinOFEDsince2008• RunsontopofTCP/IP

• ChelsioimplementsiWARP/TCP/IPstackinsilicon• Cut-throughsend• Cut-throughreceive

• Benefits• Engineeredtouse“typical”Ethernet• NoneedfortechnologieslikeDCBorQCN

• NativelyRoutable• Multi-pathsupportatLayer3(andLayer2)• ItrunsonTCP/IP• MatureandProven• GoeswhereTCP/IPgoes(everywhere)

iWARPWhatisit?


• iWARPupdatesandenhancementsaredonebytheIETFSTORM(StorageMaintenance)workinggroup

• RFCs• RFC5040ARemoteDirectMemoryAccessProtocol

Specification• RFC5041DirectDataPlacementoverReliable

Transports• RFC5044MarkerPDUAlignedFramingforTCP

Specification• RFC6580IANARegistriesfortheRDDPProtocols• RFC6581EnhancedRDMAConnectionEstablishment• RFC7306RemoteDirectMemoryAccess(RDMA)

ProtocolExtensions• Supportfromseveralvendors,Chelsio,Intel,QLogic

iWARP


• SomeUseCases• HighPerformanceComputing• SMBDirect• GPUDirectRDMA• NFSoverRDMA• FreeBSDiWARP• HadoopRDMA• LustreRDMA• NVMeoverRDMAfabrics

iWARPIncreasingInterestiniWARPasoflate


• It’sEthernet• WellUnderstoodandAdministered• UsesTCP/IP• MatureandProven• Supportsrack,cluster,datacenter,LAN/MAN/WANandwireless

• CompatiblewithSSL/TLS• Donotneedtouseanybolt-ontechnologieslike• DCB• QCN

• Doesnotrequireatotallynewnetworkinfrastructure• ReducesTCOandOpEx

iWARPAdvantagesoverOtherRDMATransports


iWARPvsRoCE

iWARP RoCENative TCP/IP over Ethernet, no different from NFS or HTTP

Difficult to install and configure - “needs a team of experts” - Plug-and-Debug

Works with ANY Ethernet switches Requires DCB - expensive equipment upgrade

Works with ALL Ethernet equipment Poor interoperability - may not work with switches from different vendors

No need for special QoS or configuration - TRUE Plug-and-Play

Fixed QoS configuration - DCB must be setup identically across all switches

No need for special configuration, preserves network robustness

Easy to break - switch configuration can cause performance collapse

TCP/IP allows reach to Cloud scale Does not scale - requires PFC, limited to single subnet

No distance limitations. Ideal for remote communication and HA

Short distance - PFC range is limited to few hundred meters maximum

WAN routable, uses any IP infrastructure RoCEv1 not routable. RoCE v2 requires lossless IP infrastructure and restricts router configuration

Standard for whole stack has been stable for a decade

ROCEv2 incompatible with v1. More fixes to missing reliability and scalability layers required and expected

Transparent and open IETF standards process Incomplete specification and opaque process


• HighPerformancePurposeBuiltProtocolProcessor• Runsmultipleprotocols

• TCPwithStatelessOffloadandFullOffload• UDPwithStatelessOffload• iWARP• FCoEwithOffload• iSCSIwithOffload

• AlloftheseprotocolsrunonT5withaSINGLEFIRMWAREIMAGE• Noneedtoreinitializethecardfordifferentuses• Futureproofe.g.supportforNVMfyetpreserves

today’sinvestmentiniSCSI

Chelsio’sT5SingleASICdoesitall


T5ASICArchitecture

▪ Singleprocessordata-flowpipelinedarchitecture

▪ Upto1Mconnections▪ ConcurrentMulti-Protocol

Operation

1G/10G/40GMAC

EmbeddedLayer2EthernetSwitch

Lookup,filteringandFirewallCut-ThroughRXMemory

Cut-ThroughTXMemory

Data-flowProtocolEngine

TrafficManager

ApplicationCo-ProcessorTX

ApplicationCo-ProcessorRX

DMAEn

gine

PCI-e

,X8,Gen

3

GeneralPurposeProcessor

OptionalexternalDDR3memory

1G/10G/40GMAC

100M/1G/10GMAC

100M/1G/10GMAC

On-ChipDRAMMemoryController

Singleconnectionat40Gb.LowLatency.

HighPerformancePurposeBuiltProtocolProcessor


LeadingUnifiedWire™ArchitectureConvergedNetworkArchitecturewithall-in-oneAdapterandSoftware

Networking▪4x10GbE/2x40GbENIC▪FullProtocolOffload▪DataCenterBridging▪Hardwarefirewall▪WireAnalytics▪DPDK/netmap

HFT▪WireDirecttechnology▪Ultralowlatency▪Highestmessages/sec▪Wirerateclassification

Storage▪NVMe/Fabrics▪SMBDirect▪iSCSIandFCoEwithT10-DIX▪iSERandNFSoverRDMA▪pNFS(NFS4.1)andLustre▪NASOffload▪Disklessboot▪Replicationandfailover

Virtualization&Cloud▪Hypervisoroffload▪SR-IOVwithembeddedVEB▪VEPA,VN-TAGs▪VXLAN/NVGRE▪NFVandSDN▪OpenStackstorage▪HadoopRDMA

HPC▪iWARPRDMAoverEthernet▪GPUDirectRDMA▪LustreRDMA▪pNFS(NFS4.1)▪OpenMPI▪MVAPICH

MediaStreaming▪TrafficManagement▪VideosegmentationOffload▪Largestreamcapacity

SingleQualification–SingleSKUConcurrentMulti-ProtocolOperation


• IntroducedbyNVIDIAwiththeKeplerClassGPUs.AvailabletodayonTeslaandQuadroGPUsaswell.

• EnablesMultipleGPUs,3rdpartynetworkadapters,SSDsandotherdevicestoreadandwriteCUDAhostanddevicememory

• AvoidsunnecessarysystemmemorycopiesandassociatedCPUoverheadbycopyingdatadirectlytoandfrompinnedGPUmemory

• Onehardwarelimitation• TheGPUandtheNetworkdeviceMUSTsharethesame

upstreamPCIerootcomplex• AvailablewithInfiniband,RoCE,andnowiWARP

GPUDirectRDMA


• Read/writeGPUmemorydirectlyfromnetworkadapter• Peer-to-peerPCIe

communication• BypasshostCPU• Bypasshostmemory

• Zerocopy• Ultralowlatency• Veryhighperformance• ScalableGPUpooling

• AnyEthernetnetworks

GPUDirectRDMAT5iWARPRDMAoverEthernetcertifiedwithNVIDIAGPUDirect

RNIC

LAN/Datacenter/WAN

Network

MEMORY MEMORY

PayloadNotifications

CPU

Payload

HostHost

CPU

Notifications

Packets Packets

GPU RNIC GPU


• ChelsioModules• cxgb4-Chelsioadapterdriver• iw_cxgb4-ChelsioiWARPdriver• rdma_ucm-RDMAUserSpaceConnectionManager

• NVIDIAModules• nvidia-NVIDIAdriver• nvidia_uvm-NVIDIAUnifiedMemory• nv_peer_mem-NVIDIAPeerMemory

ModulesrequiredforGPUDirectRMDAwithiWARP

CaseStudies


• GeneralPurposeParticlesimulationtoolkit

• Standsfor:HighlyOptimizedObject-orientedMany-particleDynamics-BlueEdition

• RunningonGPUDirectRDMA-WITHNOCHANGESTOTHECODE-ATALL!

• MoreInfo:www.codeblue.umich.edu/hoomd-blue

HOOMD-blue

http://www.codeblue.umich.edu/hoomd-blue


• 4Nodes• [email protected]• 64GBRAM• ChelsioT580-CR40GbAdapter• NVIDIATeslaK80(2GPUspercard)• RHEL6.5• OpenMPI1.10.0• OFED3.18• CUDAToolkit6.5• HOOMD-bluev1.3.1-9• Chelsio-GDR-1.0.0.0• CommandLine:$MPI_HOME/bin/mpirun --allow-run-as-root -mca btl_openib_want_cuda_gdr

1 -np X -hostfile /root/hosts -mca btl openib,sm,self -mca btl_openib_if_include cxgb4_0:1 --mca btl_openib_cuda_rdma_limit 65538 -mca btl_openib_receive_queues P,131072,64 -x CUDA_VISIBILE_DEVICES=0,1 /root/hoomd-install/bin/hoomd ./bmark.py --mode=gpu|cpu

HOOMD-blueTestConfiguration


• ClassicbenchmarkforgeneralpurposeMDsimulations.• RepresentativeoftheperformanceHOOMD-blueachievesforstraightpairpotentialsimulations

HOOMD-blueLennard-JonesLiquid64KParticlesBenchmark


HOOMD-blueLennard-JonesLiquid64KParticlesBenchmarkResults

AverageTimestepsperSecond

Test1

Test2

Test3

0 450 900 1350 1800

1,771

1,403

1,230

1,089

503

488

214

88

26

CPU GPUw/oGPUDirectRDMAGPUw/GPUDirectRDMA

LongerisBetter

2 CPU Cores2 GPUs

2 GPUs

8 CPU Cores4 GPUs

4 GPUs

40 CPU Cores8 GPUs

8 GPUs


HOOMD-blueLennard-JonesLiquid64KParticlesBenchmarkResults

Hourstocomplete10e6steps

Test1

Test2

Test3

0 30 60 90 120

1.5

1.7

2.2

2.5

5.5

6

13

32

108


ShorterisBetter

2 CPU Cores

8 CPU Cores

40 CPU Cores

2 GPUs

4 GPUs

8 GPUs

2 GPUs

4 GPUs

8 GPUs


• runsasystemofparticleswithanoscillatorypairpotentialthatformsaicosahedralquasicrystal• Thismodelisusedintheresearcharticle:EngelM,et.al.(2015)Computationalself-assemblyofaone-componenticosahedralquasicrystal,Naturematerials14(January),p.109-116.

HOOMD-blueQuasicrystalBenchmark


HOOMD-blueQuasicrystalresults

AverageTimestepsperSecond

Test1

Test2

Test3

0 300 600 900 1200

1,158

728

407

915

656

308

31

43

11


LongerisBetter

2 CPU Cores2 GPUs

2 GPUs

8 CPU Cores4 GPUs

4 GPUs

40 CPU Cores8 GPUs

8 GPUs


HOOMD-blueQuasicrystalresults

Hourstocomplete10e6steps

Test1

Test2

Test3

0 75 150 225 300

2.4

3.5

7

3

4

9

86

63

264


ShorterisBetter

2 CPU Cores

8 CPU Cores

40 CPU Cores

2 GPUs

4 GPUs

8 GPUs

2 GPUs

4 GPUs

8 GPUs


• OpensourceDeepLearningsoftwarefromBerkeleyVisionandLearningCenter

• UpdatedtoincludeCUDAsupporttoutilizeGPUs• StandardversiondoesNOTincludeMPIsupport• MPIimplementations

• mpi-caffe• Usedtotrainalargenetworkacrossaclusterofmachines• model-paralleldistributedapproach.

• caffe-parallel• Fasterframeworkfordeeplearning.• data-parallelviaMPI,splitsthetrainingdataacrossnodes

CaffeDeepLearningFramework


• iWARPprovidesRDMACapabilitiestoaEthernetnetwork

• iWARPusestriedandtrueTCP/IPasitsunderlyingtransportmechanism

• UsingiWARPdoesnotrequireawholenewnetworkinfrastructureandthemanagementrequirementsthatcomealongwithit

• iWARPcanbeusedwithexistingsoftwarerunningonGPUDirectRDMAwhichNOCHANGESrequiredtothecode

• ApplicationsthatuseGPUDirectRDMAwillseehugeperformanceimprovements

• Chelsioprovides10/40GbiWARPTODAYwith25/50/100Gbonthehorizon

SummaryGPUDirectRDMAover40GbEiWARP


• Visitourwebsite,www.chelsio.com,formoreWhitePapers,Benchmarks,etc.

• GPUDirectRDMAWhitePaper:http://www.chelsio.com/wp-content/uploads/resources/T5-40Gb-Linux-GPUDirect.pdf

• Webinar:https://www.brighttalk.com/webcast/13671/189427

• BetacodeforGPUDirectRDMAisavailableTODAYfromourdownloadsiteatservice.chelsio.com

• [email protected]• [email protected]

MoreinformationGPUDirectRDMAover40GbEiWARP

Questions?

27

ThankYou

Documents

High-Performance GPU Clustering: GPUDirect RDMA …on-demand.gputechconf.com/gtc/2016/presentation/s6854...Hadoop RDMA HPC iWARP RDMA over Ethernet GPUDirect RDMA Lustre RDMA pNFS