Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
ExplicitlyParallelPla.orms
• Explicit Parallelism, Task Parallelism • Mostly in the order of >> 10 • Requires active involvement of the programmer and /
or compiler (no free lunch) • Requires additional program constructs • Requires new programming paradigms
Amdahl’sLawGivenacomputa<onofwhichafrac<onofqcannotbeparallelized.ThenthemaximalspeedupwithPprocessorsislimitedto:
SP = T / ( q T + (1-q) T / P )withTthesequen<al<me.So,with q=0.01 maxspeedup<=100regardlessofP
q=0.05 maxspeedup<=20regardlessofP q=0.10 maxspeedup<=10regardlessofP
Flynn’sTaxonomy
• Processingunitsinparallelcomputerseitheroperateunderthecentralizedcontrolofasinglecontrolunitorworkindependently.
• Ifthereisasinglecontrolunitthatdispatchesthesameinstruc<ontovariousprocessors(thatworkondifferentdata),themodelisreferredtoassingleinstruc<onstream,mul<pledatastream(SIMD).
• Ifeachprocessorhasitsowncontrolcontrolunit,eachprocessorcanexecutedifferentinstruc<onsondifferentdataitems.Thismodeliscalledmul<pleinstruc<onstream,mul<pledatastream(MIMD).
SIMDandMIMDarchitectures
AtypicalSIMDarchitecture(a)andatypicalMIMDarchitecture(b).
SIMDProcessors• Thesameinstruc=onondifferentprocessors(func=onalunits).
Execu<onis=ghtlysynchronized.• SomeoftheearliestparallelcomputerssuchastheIlliacIV,
MPP,DAP,CM-2,andMasParMP-1belongedtothisclassofmachines.
• Variantsofthisconcepthavefounduseinco-processingunitssuchastheMMXunitsinIntelprocessorsandGPU’slikeNVIDIA.
• SIMDreliesontheregularstructureofcomputa=ons(suchasthoseinimageprocessing).
• ItisoWennecessarytoselec<velyturnoffopera<onsoncertaindataitems.Forthisreason,mostSIMDprogrammingparadigmsallowforan``ac=vitymask'',whichdeterminesifaprocessorshouldpar<cipateinacomputa<onornot.
MIMDProcessors• IncontrasttoSIMDprocessors,MIMDprocessorscan
executedifferentprogramsondifferentprocessors.• Avariantofthis,calledsingleprogrammul<pledata
streams(SPMD)executesthesameprogramondifferentprocessors,butallowsfordifferentinstruc<onstobeexecutedoneachprocessor(if/casestmts)
• ItiseasytoseethatSPMDandMIMDarecloselyrelatedintermsofprogrammingflexibilityandunderlyingarchitecturalsupport.
• Examplesofsuchpla.ormsincludecurrentgenera<onSunUltraServers,SGIOriginServers,mul=processorPCs,worksta=onclusters,andtheIBMSP.
SIMD-MIMDComparison• SIMDcomputersrequirelesshardwarethanMIMD
computers(singlecontrolunit).• However,sinceSIMDprocessorsare<ghtlysynchronizedand
thereforespeciallydesigned,theytendtobeexpensiveandhavelongdesigncycles.(NVIDIAformsanexcep<ontothis,WHY?)
• Incontrast,pla.ormssuppor<ngtheMIMD/SPMDparadigmcanbebuiltfrominexpensiveoff-the-shelfcomponentswithrela<velylibleeffortinashortamountof<me.
• Notallapplica=onsarenaturallysuitedtoSIMDprocessors.• MIMD/SPMDpla.ormshaverela<velylargecommunica=on
overhead,thereforeaskforlargegrainparallelism.
Communica<onModelofParallelPla.orms
• Therearetwoprimaryformsofdataexchangebetweenparalleltasks-accessingashareddataspaceandexchangingmessages.
• Pla.ormsthatprovideashareddataspacearecalledshared-address-spacemachinesormul<processors.
• Pla.ormsthatsupportmessagingarealsocalledmessagepassingplaQormsormul<-computers.
Shared-Address-SpacePla.orms
• Part(orall)ofthememoryisaccessibletoallprocessors.
• Processorsinteractbymodifyingdataobjectsstoredinthisshared-address-space.
• Ifthe<metakenbyaprocessortoaccessanymemorywordinthesystemglobalisiden<cal,thepla.ormisclassifiedasauniformmemoryaccessmachine(UMA).Ifthisisnotthecasethenwerefertoanon-uniformmemoryaccessmachine(NUMA).
NUMAandUMAShared-Address-SpacePla.orms
(a) Uniform-memoryaccessshared-address-spacecomputer;(b) Uniform-memory-accessshared-address-spacecomputerwithcaches
andmemories;(c) Non-uniform-memory-accessshared-address-spacecomputerwith
localmemoryonly.
ProgrammingConsequences• IncontrasttoUMApla.orms,NUMAmachinesrequirelocalityfromunderlyingalgorithmsforperformance.
• ProgrammingShared-Address-SpaceplaQormsiseasiersincereadsandwritesareimplicitlyvisibletootherprocessors.
• However,read-writedatatoshareddatamustbecoordinated.
• Cachesinsuchmachinesrequirecoordinatedaccesstomul<plecopies.Thisleadstothecachecoherenceproblem.
• Aweakermodelofthesemachinesprovidesanaddressmap,butnotcoordinatedaccess.Thesemodelsarecallednoncachecoherentsharedaddressspacemachines.
Shared-Address-Spacevs.
SharedMemoryMachines
• WerefertoShared-Address-SpacePla.ormsasaprogrammingabstrac=onandtoSharedMemoryMachinesasaphysicalmachine.
• Itispossibletoprovideasharedaddressspaceusingaphysicallydistributedmemory.
Message-PassingPla.orms
• Thesepla.ormscompriseofasetofprocessorsandtheirown(exclusive)memory.
• Naturallyexamplesareclusteredworksta<onsandnon-shared-address-spacemul<-computers.
• Thesepla.ormsareprogrammedusing(variantsof)sendandreceiveprimi=ves.
• LibrariessuchasMPI(MessagePassingInterface)andPVM(ParallelVirtualMachine)providesuchprimi<ves.OpenMPisanAPIbasedonmul<threading.
MessagePassingvs.
SharedAddressSpacePla.orms
• MessagepassingrequiresliSlehardwaresupport,otherthananetwork.
• SharedaddressspaceplaQormscaneasilyemulatemessagepassing.Thereverseismoredifficulttodo(inanefficientmanner).
Interconnec=onNetworksforParallelComputers
• Interconnec<onnetworkscarrydatabetweenprocessorsandtomemory.
• Interconnectsaremadeofswitchesandlinks(wires,fiber).
• Interconnectsareclassifiedassta=cordynamic.• Sta=cnetworksconsistofpoint-to-pointcommunica<onlinksamongprocessingnodesandarealsoreferredtoasdirectnetworks.
• Dynamicnetworksarebuiltusingswitchesandcommunica<onlinks.Dynamicnetworksarealsoreferredtoasindirectnetworks.
Sta<candDynamicInterconnec<onNetworks
Classifica<onofinterconnec<onnetworks:(a)asta=cnetwork;and(b)adynamicnetwork.
NetworkTopologies
• Avarietyofnetworktopologieshavebeenproposedandimplemented.
• Thesetopologiestradeoffperformanceforcost.
• CommercialmachinesoWenimplementhybridsofmul=pletopologiesforreasonsofpackaging,cost,andavailablecomponents.
NetworkTopologies:Buses
• Someofthesimplestandearliestparallelmachinesusedbuses.
• Allprocessorsaccessacommonbusforexchangingdata.• ThedistancebetweenanytwonodesisO(1)inabus.Thebusalsoprovidesaconvenientbroadcastmedia.
• However,thebandwidthofthesharedbusisamajorboSleneck.
• Typicalbusbasedmachinesarelimitedtodozensofnodes.Sun(Cray)serversandIntelCorebasedshared-busmul=processorsareexamplesofsucharchitectures.
NetworkTopologies:Buses
Bus-basedinterconnects(a)withnolocalcaches;(b)withlocalmemory/caches.
Sincemuchofthedataaccessedbyprocessorsislocaltotheprocessor,alocalmemorycanimprovetheperformance.
NetworkTopologies:Crossbars
Acompletelynon-blockingcrossbarnetworkconnec<ngpprocessorstobmemorybanks.
Acrossbarnetworkusesanp×mgridofswitchestoconnectpinputstomoutputsinanon-blockingmanner.
NetworkTopologies:Crossbars
• ThecostofacrossbarofpprocessorsgrowsasO(p2).
• Thisisgenerallydifficulttoscaleforlargevaluesofp.
• ExamplesofmachinesthatemploycrossbarsincludetheSunUltraHPC10000andtheFujitsuVPP500.
NetworkTopologies:Mul<stageNetworks
• Crossbarshaveexcellentperformancescalabilitybutpoorcostscalability.
• Buseshaveexcellentcostscalability,butpoorperformancescalability.
• Mul=stageinterconnectsstrikeacompromisebetweentheseextremes.
NetworkTopologies:Mul<stageNetworks
Theschema<cofatypicalmul<stageinterconnec<onnetwork.
•
NetworkTopologies:
Mul<stageOmegaNetwork• Oneofthemostcommonlyusedmul<stage
interconnectsistheOmeganetwork.• Thisnetworkconsistsoflog p stages,wherepisthenumberofinputs/outputs.
• Ateachstage,inputiisconnectedtooutputj:
•
NetworkTopologies:OmegaNetwork
EachstageoftheOmeganetworkimplementsaperfectshuffleasfollows:
Aperfectshuffleinterconnec=onforeightinputsandoutputs.
Rou<nginOmegaNetwork
X1X2X3X4 -> X1X2X3Y4 -> X2X3Y4X1 -> X2X3Y4Y1 -> Sw PS Sw PS X3Y4Y1X2 -> X3Y4Y1Y2 -> Y4Y1Y2X3 -> Y4Y1Y2Y3 -> Sw PS Sw Y1Y2Y3Y4
Connec<ngX1X2X3X4toY1Y2Y3Y4
NetworkTopologies:Mul=stageOmegaNetwork
• Theperfectshufflepabernsareconnectedusing2×2switches.
• Theswitchesoperateintwomodes:crossoverorpass-through(switchbitposi<onornot).
Twoswitchingconfigura<onsofthe2×2switch:(a)Pass-through;(b)Cross-over.
• NetworkTopologies:
Mul=stageOmegaNetwork
AcompleteomeganetworkΩ8connec<ng8inputsandeightoutputs.AnomeganetworkΩnhasn/2 * log n switchingnodes(log n stages).
AcompleteOmeganetworkwiththeperfectshuffleinterconnectsandswitchescanbeillustratedasfollows:
NetworkTopologies:theBuSerflyNetwork
Avaria<onofTheOmeganetwork
TwoStages;
InFact:Thefollowingnetworksareequivalent•
•
•
Rela<onshipwithFFT•
Rou<ngProper<es
• Clos/BenesshowedthatRNRN-1canrealizeanypermuta=on.
ProofisbasedonHall’smarriagetheorem:Imaginetwogroups;oneofnmen,andoneofnwomen.Foreachwoman,thereisasubsetofthemen,anyoneofwhichshewouldhappilymarry;andanymanwouldbehappytomarryawomanwhowantstomarryhim.Considerwhetheritispossibletopairup(inmarriage)themenandwomensothateverypersonishappy.IfweletAibethesetofmenthatthei-thwomanwouldbehappytomarry,thenthemarriagetheoremstatesthateachwomancanhappilymarryamanifandonlyforanysubsetofthewomen,thenumberofmenwhomatleastoneofthewomenwouldbehappytomarry,beatleastasbigasthenumberofwomeninthatsubset.Itisobviousthatthiscondi<onisnecessary,asifitdoesnothold,therearenotenoughmentoshareamongthewomen.Whatisinteres<ngisthatitisalsoasufficientcondi<on.
• ΩNisequivalentwithRN-1,soΩN
-1ΩNcanalsorealizeanypermuta=on.Non-blocking!!!!!
• ThisisnotthecaseforΩNΩN.• ΩNΩNΩNcanalsorealizeanypermuta=ons.Proofbasedon
coun=ngarguments,actualrou=ngisverycomplicated.
NetworkTopologies:CompletelyConnectedNetwork
• Eachprocessorisconnectedtoeveryotherprocessor.
• ThenumberoflinksinthenetworkscalesasO(p2).
• Whiletheperformancescalesverywell,thehardwareisnotrealizableforlargevaluesofp.
• Inthissense,thesenetworksaresta=ccounterpartsofcrossbars.
• NetworkTopologies:Completely
ConnectedandStarConnectedNetworks
(a)Acompletely-connectednetworkofeightnodes;(b)astarconnectednetworkofninenodes.
NetworkTopologies:StarConnectedNetwork
• Everynodeisconnectedonlytoacommonnodeatthecenter.
• DistancebetweenanypairofnodesisO(1).However,thecentralnodebecomesaboSleneck.
• Inthissense,starconnectednetworksaresta=ccounterpartsofbuses.
• NetworkTopologies:
LinearArrays,Meshes,andk-dMeshes
• Inalineararray,eachnodehastwoneighbors,onetoitsleWandonetoitsright.Ifthenodesateitherendareconnected,werefertoitasa1-Dtorusoraring.
• Ageneraliza<onto2dimensionshasnodeswith4neighbors,tothenorth,south,east,andwest.
• Afurthergeneraliza<ontoddimensionshasnodeswith2dneighbors.
• Aspecialcaseofad-dimensionalmeshisahypercube.Here,d=logp,wherepisthetotalnumberofnodes.
• NetworkTopologies:
Two-andThreeDimensionalMeshes
Twoandthreedimensionalmeshes:(a)2-Dmeshwithnowraparound;(b)2-Dmeshwithwraparoundlink(2-Dtorus);and(c)a3-Dmeshwith
nowraparound.
• NetworkTopologies:
HypercubesandtheirConstruc<on
Construc=onofhypercubesfromhypercubesoflowerdimension.
Proper=esofHypercubes
• Thedistancebetweenanytwonodesisatmostlogp.
• Eachnodehaslogpneighbors.• Thedistancebetweentwonodesisgivenbythenumberofbitposi=onsatwhichthetwonodesdiffer,andthereforeislimitedtologp.
• NetworkTopologies:Tree-BasedNetworks
Completebinarytreenetworks:(a)asta=ctreenetwork;and(b)a
dynamictreenetwork.
NetworkTopologies:TreeProper<es
• Thedistancebetweenanytwonodesisnomorethan2logp.
• Linkshigherupthetreepoten<allycarrymoretrafficthanthoseatthelowerlevels.
• Forthisreason,avariantcalledafat-tree,fabensthelinksaswegoupthetree.
• Treescanbelaidoutin2DwithnowirecrossingsinΩ(√nlogn)spacearea.
• NetworkTopologies:FatTrees
Afattreenetworkof16processingnodes.Bandwidtheach=mesdoubleswhengoinguponelevel.
Evalua=ngSta<cInterconnec<onNetworks
• Diameter:Thedistancebetweenthefarthesttwonodesinthenetwork.Thediameterofalineararrayisp−1,thatofameshis2(−1),thatofatreeandhypercubeislogp,andthatofacompletelyconnectednetworkisO(1).
• Bisec6onWidth:Theminimumnumberofwiresyoumustcuttodividethenetworkintotwoequalparts.Thebisec<onwidthofalineararrayandtreeis1,thatofameshis,thatofahypercubeisp/2andthatofacompletelyconnectednetworkisp2/4.
• Arcconnec6vity:Theminimumnumberofedges(arcs)thatneedtoberemovedtomakethegraphdisconnected.
• Vertexconnec6vity:Theminimumnumberofver=ces(nodes)thatneedtoberemovedtomakethegraphdisconnected.
• Cost:Thenumberoflinksorswitches(whicheverisasympto<callyhigher)isameaningfulmeasureofthecost.However,anumberofotherfactors,suchastheabilitytolayoutthenetwork,thelengthofwires,etc.,alsofactorintothecost.
• Evalua<ng
Sta=cInterconnec<onNetworks
Network Diameter BisectionWidth
Arc Connectivity
Cost (No. of links)
Completely-connected
Star
Complete binary tree
Linear array
2-D mesh, no wraparound
2-D wraparound mesh
Hypercube
Wraparound k-ary d-cube
•
Evalua<ngDynamicInterconnec<onNetworks
Network Diameter Bisection
Width Arc Connectivity
Cost (No. of links)
Crossbar
Omega Network
Dynamic Tree
MessagePassingCostsinParallelComputers
• Thetotal<metotransferamessageoveranetworkcomprisesofthefollowing:– Startup@me(ts):Timespentatsendingandreceivingnodes(execu<ngtherou<ngalgorithm,programmingrouters,etc.).
– Per-hop@me(th):This<meisafunc<onofnumberofhopsandincludesfactorssuchasswitchlatencies,networkdelays,etc.
– Per-wordtransfer@me(tw):This<meincludesalloverheadsthataredeterminedbythelengthofthemessage.Thisincludesbandwidthoflinks,errorcheckingandcorrec<on,etc.
Store-and-ForwardRou<ng• Amessagetraversingmul<plehopsiscompletelyreceivedatanintermediatehopbeforebeingforwardedtothenexthop.
• Thetotalcommunica<oncostforamessageofsizemwordstotraverselcommunica<onlinksis
• Inmostpla.orms,thissmallandtheaboveexpressioncanbeapproximatedby
PacketRou<ng• Store-and-forwardmakespooruseofcommunica<onresources.
• Packetrou<ngbreaksmessagesintopacketsandpipelinesthemthroughthenetwork.
• Sincepacketsmaytakedifferentpaths,eachpacketmustcarryrou<nginforma<on,errorchecking,sequencing,andotherrelatedheaderinforma<on.
• Thetotalcommunica<on<meforpacketrou<ngisapproximatedby:
• Thefactortwaccountsforoverheadsinpacketheaders.
Cut-ThroughRou<ng• Takestheconceptofpacketrou<ngtoanextremebyfurtherdividingmessagesintobasicunitscalledflits.
• Sinceflitsaretypicallysmall,theheaderinforma<onmustbeminimized.
• Thisisdonebyforcingallflitstotakethesamepath,insequence.
• Atracermessagefirstprogramsallintermediaterouters.Allflitsthentakethesameroute.
• Errorchecksareperformedontheen<remessage,asopposedtoflits.
• Nosequencenumbersareneeded.
Cut-ThroughRou<ng
• Thetotalcommunica<on<meforcut-throughrou<ngisapproximatedby:
• Thisisiden<caltopacketrou<ng,however,twistypicallymuchsmaller.
Rou<ngMechanismsforInterconnec<onNetworks
Howdoesonecomputetheroutethatamessagetakesfromsourcetodes<na<on?
– Rou<ngmustpreventdeadlocks-forthisreason,weusedimension-orderedore-cuberou<ng.
– Rou<ngmustavoidhot-spots-forthisreason,two-steprou<ngisoWenused.Inthiscase,amessagefromsourcestodes<na<ondisfirstsenttoarandomlychosenintermediateprocessoriandthenforwardedtodes<na<ond.
• CaseStudies:
TheIBMBlue-GeneArchitecture
• CaseStudies:
TheCrayT3EArchitecture
Interconnec<onnetworkoftheCrayT3E:(a)nodearchitecture;(b)networktopology.
• CaseStudies:
TheSGIOrigin3000Architecture
TheCedarArchitecture•
MasParMP1