HPC Network Stack on Arm - Ohio State...

Preview:

Citation preview

©2017ArmLimited

MvapichUserGroupMee:ng,2017Annapolis,MD

HPCNetworkStackonArm

PavelShamis/PashaPrincipalResearchEngineer

©2017ArmLimited

ArmOverview

©2017ArmLimited3

Anintroduc0ontoArm

Armistheworld'sleadingsemiconductorintellectualpropertysupplier.Welicensetoover350partners,arepresentin95%ofsmartphones,80%ofdigitalcameras,35%ofallelectronicdevices,andatotalof60billionArmcoreshavebeenshippedsince1990.

OurCPUbusinessmodel:Licensetechnologytopartners,whouseittocreatetheirownsystem-on-chip(SoC)products.

Wemaylicenseaninstruc:onsetarchitecture(ISA)suchas“ARMv8-A”)

oraspecificimplementa:on,suchas“Cortex-A72”.

PartnerswholicenseanISAcancreatetheirownimplementa:on,aslongasitpassesthecompliancetests.

…andourIPextendsbeyondtheCPU

©2017ArmLimited4

Abusinessmodelthatsharessuccess•  Everyoneinthevaluechainbenefits•  Longtermsustainability

Designonceandreuseisfundamental•  Spreadthecostamongstmanypartners•  Technologyreusedacrossmul:pleapplica:ons•  Createsmarketforecosystemtotarget

– Re-useisalsofundamentaltotheecosystemUpfrontlicensefee•  Coversthedevelopmentcost

Ongoingroyal:es•  Typicallybasedonapercentageofchipprice•  Vestedinterestinsuccessofcustomers

Apartnershipbusinessmodel

Approximately1350licensesGrowsby~120everyyear

Morethan440poten:alroyaltypayers

14.8bn+Arm-poweredchipsin2015

BusinessDevelopment

LicencefeeIP

Provider

SemiCoPartner

ArmLicensestechnologytoPartner

Partnersdevelopchips

OEMCustomer

OEMsellsconsumerproducts

Royalty

©2017ArmLimited5

RangeofSoCsaddressinginfrastructure

HighlyAccelerated MassivelyMul:core

Onesizedoesnotfitall

QorIQ®Layerscape2080A 

©2017ArmLimited6

SeriousArmHPCdeploymentsstar0ngin2017TwobigannouncementsaboutArminHPCinEurope:

©2017ArmLimited7

Japan

slidesfromFujitsuatISC’16

©2017ArmLimited

Integra0onwithNetworkInterconnects

©2017ArmLimited9

CCIX

AcceleratorsandNetwork(NIC/HCA/etc.)asafirstclass“ci:zen”inthesystem

Seamlessprocessandacceleratorhardwarecachecoherencesupport

Low-latencyandhigh-bandwidth

Allowin-lineaccelera:on•  Bumpinthewireprocessing(networkpacketprocessing,storageaccelera:on,etc.)

Allows“off-line”accelera:on(co-processormodel)

Driver-less/interrupt-lessusagemodel

hmp://www.ccixconsor:um.com

©2017ArmLimited10

Scale-upservernode

DMC-620

DMC-620 DMC-620

DMC-620

CoreLink CMN-600

Shared virtual memory system

DMC-620

DMC-620 DMC-620

DMC-620

CoreLink CMN-600

Accelerator

SmartNetwork

PersistentMemory

©2017ArmLimited11

CCIXmul0chipconnec0vityandtopologiesNewclassofinterconnectprovidinghighperformance,lowlatencyfornewacceleratorsusecases

•  CCIXdefines25GT/s(3xperformance*)

•  Examining56GT/s(7xperformance*)andbeyond

•  Enablinglowlatencyvialighttransac:onlayer

Flexible,scalableinterconnecttopologies

•  Flexiblepoint-to-point,daisychainedandswitchedtopologies

Simplifieddeploymentbyleveragingexis:ngPCIehardwareandsopwareinfrastructure

•  Runsonexis:ngPCIetransportlayerandmanagementstack

•  CoexistwithlegacyPCIedesigns

ComputeNode

Accelerator

SmartNetwork

PersistentMemory

Switch

*Note:BasedonPCIeGen3Performance

©2017ArmLimited12

BuildingCCIXdevices

Cadence®IPforCCIX•  BuiltuponsiliconprovenPCIesolu:ons

IPproducts:•  ControllerIP–ProvidestheCCIXtransac:onanddatalinklayers.

•  PHYIP–ProvidesthehighperformanceSERDESphysicallayersuppor:ngspeedsupto25Gpbs.

•  Verifica:onIP–ProvidesthenecessarytestinfrastructuretoverifyCCIXdesigns.

CadenceController&PHYIP

©2017ArmLimited13

CadenceCCIXintegra0on

PCIe

Transac:on

Layer

CCIX

Transac:on

Layer

DataLinkLayer

PHY

(upto25G

pbs)

16Lanes

DMC-620

DMC-620 DMC-620

DMC-620

CoreLink CMN-600 XP

CML

RNI

CXS

AXI

CadenceIP

ExampleCMN-600meshdesignLow latency CCIX transaction layer

Support up to 25Gbps vs 16Gbps PCIe Gen4

CML converts CHI to CCIX messages

PCIe IP connects to a CMN IO interface via AXI

©2017ArmLimited14

Gen-Z

AlldataisaccessedbysomeformofaReadoraWrite

Exampleofreads:DDRRow+ColumnRead,PCIDMARead,SCSIWrite,SocketRead,FileRead,RDMARead

Exampleofwrites:DDRRow+ColumnWrite,PCIDMAWrite,SCSIRead,SocketWrite,FileWrite,RDMAWrite

TheGoal:Simplifyworldtomemoryseman:cReads&Writes

©2017ArmLimited15

Gen-ZOverview

Anopen,standards-based,scalable,systeminterconnectandprotocol.

Op:mizedtosupportmemoryseman:ccommunica:ons

BreaksProcessor-MemoryInterlock

Splitcontrollermodel•  Memorycontroller

–  Ini:ateshigh-levelrequests—Read,Write,Atomic,Put/Get,etc.–  Enforcesordering,reliability,pathselec:on,etc.

•  Mediacontroller–  Abstractsmemorymedia–  Supportsvola:le/non-vola:le/mixed-media–  Performsmedia-specificopera:ons–  Executesrequestsandreturnsresponses–  Enablesdata-centriccompu:ng(accelerator,compute,etc.)

©2017ArmLimited16

IntroducingtheScalableVectorExtension(SVE)A vector extension to the ARMv8-A architecture with some major new features:

Gather-loadandscamer-storeLoadsasingleregisterfromseveralnon-con:guousmemoryloca:ons.

Per-lanepredica:onOpera:onsworkonindividuallanesundercontrolofapredicateregister.

Predicate-drivenloopcontrolandmanagementEliminatescalarloopheadsandtailsbyprocessingpar:alvectors.

Vectorpar::oningandsopware-managedspecula:onFirstFaul:ngLoadinstruc:onsallowmemoryaccessestocrossintoinvalidpages.

Extendedfloa:ng-pointhorizontalreduc:onsIn-orderandtree-basedreduc:onstrade-offperformanceandrepeatability.

1 2 3 45 5 5 51 0 1 0

6 2 8 4

+

=pred

1 2 0 01 1 0 0

+pred

1 2

1 +2 +3 +4

1 +2

+

3 +4

3 7= =

=

=

n-21 01 0CMPLT n

n-1 n n+1INDEX i

for (i = 0; i < n; ++i)

©2017ArmLimited17

What’sthevectorlength?

Thereisnopreferredvectorlength

•  VectorLength(VL)istheCPUimplementor’schoice,from128to2048bits,inincrementsof128

•  Adop:ngaVectorLengthAgnos:c(VLA)codegenera:onstylemakescodeportableacrossallpossiblevectorlengths.

•  VLAismadepossiblebytheper-lanepredica:on,predicate-drivenloopcontrol,vectorpar::oningandsopware-managedspecula:onfeaturesofSVE.

•  Noneedtorecompile,ortorewritehand-codedSVEassemblerorCintrinsics

VL

128b

0 64

256b

0 128

384b

0 128 256

128

256

384

512b

0 128 256 384 512

©2017ArmLimited

SoWwareEco-system

©2017ArmLimited19

Released

Planned

Concept

2016

Hardware

Open-Sourcesopware

ArmHPCtools

Arm HPC ecosystem roadmap

2017 Future

CaviumThunderX

AllineaDDTandMAP RogueWaveTotalView

AppliedMicroX-Gene1&2

Fujitsu–PostK(SVE)

ISVsopware

AMDSeamle

AltairPBSPro

CaviumThunderX2

QualcommCentriq

NAGLibrary&Compiler

ArmOp:mizedRou:nes

GCC(gcc/g++/gfortran)LLVM-clang LLVM–Flang

ISVsopware

ArmCodeAdvisor(Beta) ArmCodeAdvisor(Fullrelease)

ArmOp:mizedRou:nes–vectorversions

Phy:umMars

OpenHPC1.3.1

ArmPerformanceLibraries

PathScaleENZO

ArmFortranCompilerArmC/C++Compiler–aheadofLLVMtrunk

ArmInstruc:onEmulator

AppliedMicroX-Gene3

©2017ArmLimited20

©2017ArmLimited21

Linux/FreeBSDw/AARCH64support

ßEngagedwithFreeBSDfounda:on/Semi-half&CaviumtogetFreeBSDonARMv8FreeBSDBetaversiondemo’dbySemihalf–Nov.2015

12.04LTS&14.04LTSreleased

ßAlso14.10&15.04releases

RedHatEnterpriseLinuxServerforArm7.2BETA–Sept,2015

Fedora22released–May2015Fedora23released–Nov2015

CentOSLinux7forAArch64GA–August2015

OpenSUSE13.2–Nov2014 SUSELaunchesPartnerProgramtoBringSUSELinuxEnterprise12to64-bitArm

July2015@ISC

Debian8addsAARCH64–April2015

©2017ArmLimited22

Opensourceandcommercialcompilers

§  LLVM §  C, C++, Fortran §  OpenMP 3.1, (4.0 coming soon) §  Fortran coming Q1 2017

§  NAG §  Fortran §  OpenMP 3.1

§  GCC §  C, C++, Fortran §  OpenMP 4.0

§  Arm C/C++/Fortran Compiler §  LLVM based §  Includes SVE

©2017ArmLimited

RDMA

©2017ArmLimited24

RDMANetworks

RemoteDirectMemoryAccess(RDMA)–popularhardwarenetworktechnology•  InfiniBand–37%ofsystemsinTOP500

hmps://www.top500.org

©2017ArmLimited25

VERBsAPIonArm

•  Besidesbugfixesnotmuchworkwasrequired•  MellanoxOFED2.4andabovesupportsArm•  LinuxKernel4.5.0andabove(maybeevenearlier)•  LinuxDistribu:onSupport–ongoingprocess•  OFED–noARMv8support

©2017ArmLimited26

OpenUCXv1.2

•  ThefirstofficialreleasefromOpenUCXcommunity•  hmps://github.com/openucx/ucx/releases/tag/v1.2.0

•  Features•  SupportforInfiniBandandRoCE

•  TransportsRC,UD,DC

•  SupportforAcceleratedVerbs–40%speeduponArmcomparedtovanillaVerbs

•  SupportforUGNIAPIforAriesandGemini

•  SupportforSharedMemory:KNEM,CMA,XPMEM,Posix,SySV

•  Supportforx86,ARMv8,Power

•  Efficientmemorypolling–36%increaseinefficiencyonArm

•  UCXinterfaceisintegratedwithMPICH,OpenMPI,OSHMEM,ORNL-SHMEM,etc.

PavelShamis,M.GrahamLopez,andGiladShainer.“EnablingOne-sidedCommunica@onSeman@csonArm“,HIPS2017

©2017ArmLimited27

Programingmodels

MVAPICH2.3b–worksonARMv8!

OpenMPI–worksonARMv8

MPICH–compilesandrunsonARMv8

OSHMEM–compilesandruns

©2017ArmLimited28

LessonsLearned

MemoryBarriers•  Mul:threadenvironment

•  Sopware-hardwareinterac:on

•  Exampleshmps://github.com/openucx/ucx/blob/master/src/ucs/arch/aarch64/cpu.h#L25

•  Youcan“fish”forthesebugsinMPIimplementa:onsaroundEager-RDMAandsharedmemoryprotocols

WritePayload

WriteNo0fy

Busy-waitReadNo0fy

ReadPayload

WriteBarrier ReadBarrier

RDMA

RDMA

Maranget,Luc,SusmitSarkar,andPeterSewell."Atutorialintroduc@ontotheArmandPOWERrelaxedmemorymodels."DraQavailablefromhSp://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf(2012).

©2017ArmLimited29

LessonsLearned–con0nued

Notallcache-linesare64Byte!•  Implementa:ondependent

•  128Byteand64Byte

hmp://xeroxnostalgia.com/duplicators/xerox-9200/

©2017ArmLimited

PreliminaryResults

©2017ArmLimited31

Testbed

•  2xSopironOverdrive3000serverswithAMDOpteronA1100/2GHz

•  ConnectX-4IB/VPIEDR(PCIegen2x8)•  Ubuntu16.04•  MOFED3.3-1.5.0.0

•  UCX[0558b41]•  XPMEM[bdfcc52]

•  OSHMEM/OPEN-MPI[fed4849]

PavelShamis,M.GrahamLopez,andGiladShainer.“EnablingOne-sidedCommunica@onSeman@csonArm“

©2017ArmLimited32

HardwareSoWwareStackOverview

©2017ArmLimited33

XPMEM

XPMEM-CrossMemoryAmach•  ThecodewasupdatedtocompileandrunonARMv8(hmps://github.com/hjelmn/xpmem)

•  3rdpartykernelmodulebasedonSGI/CrayXPMEMandmaintainedbyLANL

XPMEM KNEM CMA MMAP

SystemCall No Yes Yes No

LowLatency V X X V

HighBandwidth V V V V

AnyVirtualAddress V V V X

OpenSource V V V V

UpstreamLinux X X V V

©2017ArmLimited34

OpenUCX:XPMEM

�����������

�������

����

������

���������

����

� � � �� �� �� ���

���

���

����

����

����

����

�����

�����

�����

������

������

������

�������

�������

�����������������

��������

��� �������

����� �������� ���

�����

�����

�����

�����

�����

�����

�����

�����

� � � � �� �� �� ���

���

���

����

���������������

��������

��� ������� ����

����� �������� ���

����

����

����

����

�����

�����

�����

�����

� � � �� �� �� ���

���

���

����

����

����

����

�����

�����

�����

������

������

������

�������

�������

����

��������

��� ���������

����� �������� ���

����

����

����

����

����

����

����

����

����

���

�����

������

������

�������

�����

������

������

�������

������������

������������

��� ������ ������ ���������� �����

©2017ArmLimited35

OpenUCXIB:MLX5vsVerbs

��

��

��

���

���

���

����

� � � �� �� �� ���

���

���

����

����

����

����

�����

�����

�����

������

������

������

�������

�������

�����������������

��������

��� �������

���� �������� ������� �������� ���

���

���

���

���

���

���

� � � � �� �� �� ���

���

���

����

����������

����������

�����

��������

��� ��� ������� ����

���� �������� ���

���

����

����

����

����

����

����

� � � �� �� �� ���

���

���

����

����

����

����

�����

�����

�����

������

������

������

�������

�������

����

��������

��� ��� ���������

���� �������� ���

���

���

���

���

�����

������

����

��

�������

�����

������

����

��

�������

������������

������������

��� ������ ������ ���������� �����

���������

40%

©2017ArmLimited36

SHMEM_WAIT()

���

���

���

��� ���������������������

�������������������

���������������������

���

������� �� ����� ������������� ������������

�����

�����

�������

�����

��� ���������������������

�������������������

����������������

���

����� �������������������� �����

�����

�����

�����

�����

�����

�������

�������

�������

��� ���������������������

�������������������

����������

���

����� �������������� �����

73%

35%

©2017ArmLimited37

OpenSHMEMSSCA

���

���

���

���

���

� � � ��

�������������

���

���� ���������

��� �������� ����

��� �����

7-30%

©2017ArmLimited38

OpenSHMEMGUPs

21%

�������

�������

�������

�������

�������

�������

�������

� � � ��

��������������

����

�����

���

��������� ���� ���������

��� ���� ���������� ����� �������

��� ���� ���������������� ����� �������������

©2017ArmLimited39

Developerwebsite:www.arm.com/hpc

AHPC-specificmicrositeThisishometoourHPCecosystemoffering:•  technicalreferencematerial•  how-toguides•  latestnewsandupdatesfrompartners•  downloadsofHPClibraries•  third-partysopwarerecommenda:ons•  webforumforcommunitydiscussionandhelp

Par:cipateandhelpdrivethecommunity

4040 ©2017ArmLimited

TheArmtrademarksfeaturedinthispresenta:onareregisteredtrademarksortrademarksofArmLimited(oritssubsidiaries)intheUSand/orelsewhere.Allrightsreserved.Allothermarksfeaturedmaybetrademarksoftheirrespec:veowners.www.arm.com/company/policies/trademarks

Recommended