Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
©2017ArmLimited
MvapichUserGroupMee:ng,2017Annapolis,MD
HPCNetworkStackonArm
PavelShamis/PashaPrincipalResearchEngineer
©2017ArmLimited
ArmOverview
©2017ArmLimited3
Anintroduc0ontoArm
Armistheworld'sleadingsemiconductorintellectualpropertysupplier.Welicensetoover350partners,arepresentin95%ofsmartphones,80%ofdigitalcameras,35%ofallelectronicdevices,andatotalof60billionArmcoreshavebeenshippedsince1990.
OurCPUbusinessmodel:Licensetechnologytopartners,whouseittocreatetheirownsystem-on-chip(SoC)products.
Wemaylicenseaninstruc:onsetarchitecture(ISA)suchas“ARMv8-A”)
oraspecificimplementa:on,suchas“Cortex-A72”.
PartnerswholicenseanISAcancreatetheirownimplementa:on,aslongasitpassesthecompliancetests.
…andourIPextendsbeyondtheCPU
©2017ArmLimited4
Abusinessmodelthatsharessuccess• Everyoneinthevaluechainbenefits• Longtermsustainability
Designonceandreuseisfundamental• Spreadthecostamongstmanypartners• Technologyreusedacrossmul:pleapplica:ons• Createsmarketforecosystemtotarget
– Re-useisalsofundamentaltotheecosystemUpfrontlicensefee• Coversthedevelopmentcost
Ongoingroyal:es• Typicallybasedonapercentageofchipprice• Vestedinterestinsuccessofcustomers
Apartnershipbusinessmodel
Approximately1350licensesGrowsby~120everyyear
Morethan440poten:alroyaltypayers
14.8bn+Arm-poweredchipsin2015
BusinessDevelopment
LicencefeeIP
Provider
SemiCoPartner
ArmLicensestechnologytoPartner
Partnersdevelopchips
OEMCustomer
OEMsellsconsumerproducts
Royalty
©2017ArmLimited5
RangeofSoCsaddressinginfrastructure
HighlyAccelerated MassivelyMul:core
Onesizedoesnotfitall
QorIQ®Layerscape2080A
©2017ArmLimited6
SeriousArmHPCdeploymentsstar0ngin2017TwobigannouncementsaboutArminHPCinEurope:
©2017ArmLimited7
Japan
slidesfromFujitsuatISC’16
©2017ArmLimited
Integra0onwithNetworkInterconnects
©2017ArmLimited9
CCIX
AcceleratorsandNetwork(NIC/HCA/etc.)asafirstclass“ci:zen”inthesystem
Seamlessprocessandacceleratorhardwarecachecoherencesupport
Low-latencyandhigh-bandwidth
Allowin-lineaccelera:on• Bumpinthewireprocessing(networkpacketprocessing,storageaccelera:on,etc.)
Allows“off-line”accelera:on(co-processormodel)
Driver-less/interrupt-lessusagemodel
hmp://www.ccixconsor:um.com
©2017ArmLimited10
Scale-upservernode
DMC-620
DMC-620 DMC-620
DMC-620
CoreLink CMN-600
Shared virtual memory system
DMC-620
DMC-620 DMC-620
DMC-620
CoreLink CMN-600
Accelerator
SmartNetwork
PersistentMemory
©2017ArmLimited11
CCIXmul0chipconnec0vityandtopologiesNewclassofinterconnectprovidinghighperformance,lowlatencyfornewacceleratorsusecases
• CCIXdefines25GT/s(3xperformance*)
• Examining56GT/s(7xperformance*)andbeyond
• Enablinglowlatencyvialighttransac:onlayer
Flexible,scalableinterconnecttopologies
• Flexiblepoint-to-point,daisychainedandswitchedtopologies
Simplifieddeploymentbyleveragingexis:ngPCIehardwareandsopwareinfrastructure
• Runsonexis:ngPCIetransportlayerandmanagementstack
• CoexistwithlegacyPCIedesigns
ComputeNode
Accelerator
SmartNetwork
PersistentMemory
Switch
*Note:BasedonPCIeGen3Performance
©2017ArmLimited12
BuildingCCIXdevices
Cadence®IPforCCIX• BuiltuponsiliconprovenPCIesolu:ons
IPproducts:• ControllerIP–ProvidestheCCIXtransac:onanddatalinklayers.
• PHYIP–ProvidesthehighperformanceSERDESphysicallayersuppor:ngspeedsupto25Gpbs.
• Verifica:onIP–ProvidesthenecessarytestinfrastructuretoverifyCCIXdesigns.
CadenceController&PHYIP
©2017ArmLimited13
CadenceCCIXintegra0on
PCIe
Transac:on
Layer
CCIX
Transac:on
Layer
DataLinkLayer
PHY
(upto25G
pbs)
16Lanes
DMC-620
DMC-620 DMC-620
DMC-620
CoreLink CMN-600 XP
CML
RNI
CXS
AXI
CadenceIP
ExampleCMN-600meshdesignLow latency CCIX transaction layer
Support up to 25Gbps vs 16Gbps PCIe Gen4
CML converts CHI to CCIX messages
PCIe IP connects to a CMN IO interface via AXI
©2017ArmLimited14
Gen-Z
AlldataisaccessedbysomeformofaReadoraWrite
Exampleofreads:DDRRow+ColumnRead,PCIDMARead,SCSIWrite,SocketRead,FileRead,RDMARead
Exampleofwrites:DDRRow+ColumnWrite,PCIDMAWrite,SCSIRead,SocketWrite,FileWrite,RDMAWrite
TheGoal:Simplifyworldtomemoryseman:cReads&Writes
©2017ArmLimited15
Gen-ZOverview
Anopen,standards-based,scalable,systeminterconnectandprotocol.
Op:mizedtosupportmemoryseman:ccommunica:ons
BreaksProcessor-MemoryInterlock
Splitcontrollermodel• Memorycontroller
– Ini:ateshigh-levelrequests—Read,Write,Atomic,Put/Get,etc.– Enforcesordering,reliability,pathselec:on,etc.
• Mediacontroller– Abstractsmemorymedia– Supportsvola:le/non-vola:le/mixed-media– Performsmedia-specificopera:ons– Executesrequestsandreturnsresponses– Enablesdata-centriccompu:ng(accelerator,compute,etc.)
©2017ArmLimited16
IntroducingtheScalableVectorExtension(SVE)A vector extension to the ARMv8-A architecture with some major new features:
Gather-loadandscamer-storeLoadsasingleregisterfromseveralnon-con:guousmemoryloca:ons.
Per-lanepredica:onOpera:onsworkonindividuallanesundercontrolofapredicateregister.
Predicate-drivenloopcontrolandmanagementEliminatescalarloopheadsandtailsbyprocessingpar:alvectors.
Vectorpar::oningandsopware-managedspecula:onFirstFaul:ngLoadinstruc:onsallowmemoryaccessestocrossintoinvalidpages.
Extendedfloa:ng-pointhorizontalreduc:onsIn-orderandtree-basedreduc:onstrade-offperformanceandrepeatability.
1 2 3 45 5 5 51 0 1 0
6 2 8 4
+
=pred
1 2 0 01 1 0 0
+pred
1 2
1 +2 +3 +4
1 +2
+
3 +4
3 7= =
=
=
n-21 01 0CMPLT n
n-1 n n+1INDEX i
for (i = 0; i < n; ++i)
©2017ArmLimited17
What’sthevectorlength?
Thereisnopreferredvectorlength
• VectorLength(VL)istheCPUimplementor’schoice,from128to2048bits,inincrementsof128
• Adop:ngaVectorLengthAgnos:c(VLA)codegenera:onstylemakescodeportableacrossallpossiblevectorlengths.
• VLAismadepossiblebytheper-lanepredica:on,predicate-drivenloopcontrol,vectorpar::oningandsopware-managedspecula:onfeaturesofSVE.
• Noneedtorecompile,ortorewritehand-codedSVEassemblerorCintrinsics
VL
128b
0 64
256b
0 128
384b
0 128 256
128
256
384
512b
0 128 256 384 512
©2017ArmLimited
SoWwareEco-system
©2017ArmLimited19
Released
Planned
Concept
2016
Hardware
Open-Sourcesopware
ArmHPCtools
Arm HPC ecosystem roadmap
2017 Future
CaviumThunderX
AllineaDDTandMAP RogueWaveTotalView
AppliedMicroX-Gene1&2
Fujitsu–PostK(SVE)
ISVsopware
AMDSeamle
AltairPBSPro
CaviumThunderX2
QualcommCentriq
NAGLibrary&Compiler
ArmOp:mizedRou:nes
GCC(gcc/g++/gfortran)LLVM-clang LLVM–Flang
ISVsopware
ArmCodeAdvisor(Beta) ArmCodeAdvisor(Fullrelease)
ArmOp:mizedRou:nes–vectorversions
Phy:umMars
OpenHPC1.3.1
ArmPerformanceLibraries
PathScaleENZO
ArmFortranCompilerArmC/C++Compiler–aheadofLLVMtrunk
ArmInstruc:onEmulator
AppliedMicroX-Gene3
©2017ArmLimited20
©2017ArmLimited21
Linux/FreeBSDw/AARCH64support
ßEngagedwithFreeBSDfounda:on/Semi-half&CaviumtogetFreeBSDonARMv8FreeBSDBetaversiondemo’dbySemihalf–Nov.2015
12.04LTS&14.04LTSreleased
ßAlso14.10&15.04releases
RedHatEnterpriseLinuxServerforArm7.2BETA–Sept,2015
Fedora22released–May2015Fedora23released–Nov2015
CentOSLinux7forAArch64GA–August2015
OpenSUSE13.2–Nov2014 SUSELaunchesPartnerProgramtoBringSUSELinuxEnterprise12to64-bitArm
July2015@ISC
Debian8addsAARCH64–April2015
©2017ArmLimited22
Opensourceandcommercialcompilers
§ LLVM § C, C++, Fortran § OpenMP 3.1, (4.0 coming soon) § Fortran coming Q1 2017
§ NAG § Fortran § OpenMP 3.1
§ GCC § C, C++, Fortran § OpenMP 4.0
§ Arm C/C++/Fortran Compiler § LLVM based § Includes SVE
©2017ArmLimited
RDMA
©2017ArmLimited24
RDMANetworks
RemoteDirectMemoryAccess(RDMA)–popularhardwarenetworktechnology• InfiniBand–37%ofsystemsinTOP500
hmps://www.top500.org
©2017ArmLimited25
VERBsAPIonArm
• Besidesbugfixesnotmuchworkwasrequired• MellanoxOFED2.4andabovesupportsArm• LinuxKernel4.5.0andabove(maybeevenearlier)• LinuxDistribu:onSupport–ongoingprocess• OFED–noARMv8support
©2017ArmLimited26
OpenUCXv1.2
• ThefirstofficialreleasefromOpenUCXcommunity• hmps://github.com/openucx/ucx/releases/tag/v1.2.0
• Features• SupportforInfiniBandandRoCE
• TransportsRC,UD,DC
• SupportforAcceleratedVerbs–40%speeduponArmcomparedtovanillaVerbs
• SupportforUGNIAPIforAriesandGemini
• SupportforSharedMemory:KNEM,CMA,XPMEM,Posix,SySV
• Supportforx86,ARMv8,Power
• Efficientmemorypolling–36%increaseinefficiencyonArm
• UCXinterfaceisintegratedwithMPICH,OpenMPI,OSHMEM,ORNL-SHMEM,etc.
PavelShamis,M.GrahamLopez,andGiladShainer.“EnablingOne-sidedCommunica@onSeman@csonArm“,HIPS2017
©2017ArmLimited27
Programingmodels
MVAPICH2.3b–worksonARMv8!
OpenMPI–worksonARMv8
MPICH–compilesandrunsonARMv8
OSHMEM–compilesandruns
©2017ArmLimited28
LessonsLearned
MemoryBarriers• Mul:threadenvironment
• Sopware-hardwareinterac:on
• Exampleshmps://github.com/openucx/ucx/blob/master/src/ucs/arch/aarch64/cpu.h#L25
• Youcan“fish”forthesebugsinMPIimplementa:onsaroundEager-RDMAandsharedmemoryprotocols
WritePayload
WriteNo0fy
Busy-waitReadNo0fy
ReadPayload
WriteBarrier ReadBarrier
RDMA
RDMA
Maranget,Luc,SusmitSarkar,andPeterSewell."Atutorialintroduc@ontotheArmandPOWERrelaxedmemorymodels."DraQavailablefromhSp://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf(2012).
©2017ArmLimited29
LessonsLearned–con0nued
Notallcache-linesare64Byte!• Implementa:ondependent
• 128Byteand64Byte
hmp://xeroxnostalgia.com/duplicators/xerox-9200/
©2017ArmLimited
PreliminaryResults
©2017ArmLimited31
Testbed
• 2xSopironOverdrive3000serverswithAMDOpteronA1100/2GHz
• ConnectX-4IB/VPIEDR(PCIegen2x8)• Ubuntu16.04• MOFED3.3-1.5.0.0
• UCX[0558b41]• XPMEM[bdfcc52]
• OSHMEM/OPEN-MPI[fed4849]
PavelShamis,M.GrahamLopez,andGiladShainer.“EnablingOne-sidedCommunica@onSeman@csonArm“
©2017ArmLimited32
HardwareSoWwareStackOverview
©2017ArmLimited33
XPMEM
XPMEM-CrossMemoryAmach• ThecodewasupdatedtocompileandrunonARMv8(hmps://github.com/hjelmn/xpmem)
• 3rdpartykernelmodulebasedonSGI/CrayXPMEMandmaintainedbyLANL
XPMEM KNEM CMA MMAP
SystemCall No Yes Yes No
LowLatency V X X V
HighBandwidth V V V V
AnyVirtualAddress V V V X
OpenSource V V V V
UpstreamLinux X X V V
©2017ArmLimited34
OpenUCX:XPMEM
�����������
�������
����
������
���������
����
� � � �� �� �� ���
���
���
����
����
����
����
�����
�����
�����
������
������
������
�������
�������
�����������������
��������
��� �������
����� �������� ���
�����
�����
�����
�����
�����
�����
�����
�����
� � � � �� �� �� ���
���
���
����
���������������
��������
��� ������� ����
����� �������� ���
����
����
����
����
�����
�����
�����
�����
� � � �� �� �� ���
���
���
����
����
����
����
�����
�����
�����
������
������
������
�������
�������
����
��������
��� ���������
����� �������� ���
����
����
����
����
����
����
����
����
����
���
�����
������
������
�������
�����
������
������
�������
������������
������������
��� ������ ������ ���������� �����
©2017ArmLimited35
OpenUCXIB:MLX5vsVerbs
�
�
�
�
��
��
��
���
���
���
����
� � � �� �� �� ���
���
���
����
����
����
����
�����
�����
�����
������
������
������
�������
�������
�����������������
��������
��� �������
���� �������� ������� �������� ���
���
���
���
���
���
���
� � � � �� �� �� ���
���
���
����
����������
����������
�����
��������
��� ��� ������� ����
���� �������� ���
���
����
����
����
����
����
����
� � � �� �� �� ���
���
���
����
����
����
����
�����
�����
�����
������
������
������
�������
�������
����
��������
��� ��� ���������
���� �������� ���
���
�
���
�
���
�
���
�����
������
����
��
�������
�����
������
����
��
�������
������������
������������
��� ������ ������ ���������� �����
���������
40%
©2017ArmLimited36
SHMEM_WAIT()
���
�
���
�
���
��� ���������������������
�������������������
���������������������
���
������� �� ����� ������������� ������������
�
�����
�����
�������
�����
��� ���������������������
�������������������
����������������
���
����� �������������������� �����
�
�����
�����
�����
�����
�����
�������
�������
�������
��� ���������������������
�������������������
����������
���
����� �������������� �����
73%
35%
©2017ArmLimited37
OpenSHMEMSSCA
�
���
���
���
���
���
� � � ��
�������������
���
���� ���������
��� �������� ����
��� �����
7-30%
©2017ArmLimited38
OpenSHMEMGUPs
21%
�������
�������
�������
�������
�������
�������
�������
� � � ��
��������������
����
�����
���
��������� ���� ���������
��� ���� ���������� ����� �������
��� ���� ���������������� ����� �������������
©2017ArmLimited39
Developerwebsite:www.arm.com/hpc
AHPC-specificmicrositeThisishometoourHPCecosystemoffering:• technicalreferencematerial• how-toguides• latestnewsandupdatesfrompartners• downloadsofHPClibraries• third-partysopwarerecommenda:ons• webforumforcommunitydiscussionandhelp
Par:cipateandhelpdrivethecommunity
4040 ©2017ArmLimited
TheArmtrademarksfeaturedinthispresenta:onareregisteredtrademarksortrademarksofArmLimited(oritssubsidiaries)intheUSand/orelsewhere.Allrightsreserved.Allothermarksfeaturedmaybetrademarksoftheirrespec:veowners.www.arm.com/company/policies/trademarks