Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
WU UCB CS252 SP17
CS252 Spring 2017Graduate Computer Architecture
Lecture 9:Vector Supercomputers
Lisa Wu, Krste Asanovichttp://inst.eecs.berkeley.edu/~cs252/sp17
WU UCB CS252 SP17
Last Time in Lecture 8
Overcoming the worst hazards in OoO superscalars:• Branch prediction
• Bimodel• Local/Branch History Table• Global/gselect, gshare• Tournament• Branch address cache (predict multiple branches per cycle)• Trace cache• Return Address Predictors
• Today - Load/Store Queues, Vector Supercomputers
2
©KrsteAsanovic,2015CS252,Spring2015,Lecture9
Load-StoreQueueDesign
§ Aftercontrolhazards,datahazardsthroughmemoryareprobablynextmostimportantbottlenecktosuperscalarperformance
§ Modernsuperscalarsuseverysophisticatedload-storereorderingtechniquestoreduceeffectivememorylatencybyallowingloadstobespeculativelyissued
3
©KrsteAsanovic,2015CS252,Spring2015,Lecture9
SpeculativeStoreBuffer§ Justlikeregisterupdates,storesshouldnotmodifythememoryuntilaftertheinstructioniscommitted.Aspeculativestorebufferisastructureintroducedtoholdspeculativestoredata.
§ Duringdecode,storebufferslotallocatedinprogramorder
§ Storessplitinto“storeaddress”and“storedata”micro-operations
§ “Storeaddress”executionwritestag§ “Storedata”executionwritesdata§ Storecommitswhenoldestinstructionandbothaddressanddataavailable:- clearspeculativebitandeventuallymovedatatocache
§ Onstoreabort:- clearvalidbit
4
DataTags
StoreCommitPath
SpeculativeStoreBuffer
L1DataCache
Tag DataSVTag DataSVTag DataSVTag DataSVTag DataSVTag DataSV
StoreAddress
StoreData
©KrsteAsanovic,2015CS252,Spring2015,Lecture9
Loadbypassfromspeculativestorebuffer
5
§ Ifdatainbothstorebufferandcache,whichshouldweuse?Speculativestorebuffer
§ Ifsameaddressinstorebuffertwice,whichshouldweuse?Youngeststoreolderthanload
Data
LoadAddress
Tags
SpeculativeStoreBuffer L1DataCache
LoadData
Tag DataSVTag DataSVTag DataSVTag DataSVTag DataSVTag DataSV
©KrsteAsanovic,2015CS252,Spring2015,Lecture9
MemoryDependenciessd x1, (x2)ld x3, (x4)
§ Whencanweexecutetheload?
6
©KrsteAsanovic,2015CS252,Spring2015,Lecture9
In-OrderMemoryQueue
§ Executeallloadsandstoresinprogramorder
§ =>LoadandstorecannotleaveROBforexecutionuntilallpreviousloadsandstoreshavecompletedexecution
§ Canstillexecuteloadsandstoresspeculatively,andout-of-orderwithrespecttootherinstructions
§ Needastructuretohandlememoryordering…
7
©KrsteAsanovic,2015CS252,Spring2015,Lecture9
ConservativeO-o-OLoadExecutionsd x1, (x2)ld x3, (x4)
§ Canexecuteloadbeforestore,ifaddressesknownandx4 !=x2
§ Eachloadaddresscomparedwithaddressesofallpreviousuncommittedstores- canusepartialconservativechecki.e.,bottom12bitsofaddress,tosavehardware
§ Don’texecuteloadifanypreviousstoreaddressnotknown
§ (MIPSR10K,16-entryaddressqueue)
8
©KrsteAsanovic,2015CS252,Spring2015,Lecture9
AddressSpeculationsd x1, (x2)ld x3, (x4)
§ Guessthatx4 !=x2§ Executeloadbeforestoreaddressknown§ Needtoholdallcompletedbutuncommittedload/storeaddressesinprogramorder
§ Ifsubsequentlyfindx4==x2,squashloadandallfollowinginstructions
§ =>Largepenaltyforinaccurateaddressspeculation
9
©KrsteAsanovic,2015CS252,Spring2015,Lecture9
MemoryDependencePrediction(Alpha21264)
sd x1, (x2)ld x3, (x4)
§ Guessthatx4 !=x2 andexecuteloadbeforestore
§ Iflaterfindx4==x2,squashloadandallfollowinginstructions,butmarkloadinstructionasstore-wait
§ Subsequentexecutionsofthesameloadinstructionwillwaitforallpreviousstorestocomplete
§ Periodicallyclearstore-waitbits10
©KrsteAsanovic,2015CS252,Spring2015,Lecture9
SupercomputerApplications
§ Typicalapplicationareas- Militaryresearch(nuclearweapons,cryptography)- Scientificresearch- Weatherforecasting- Oilexploration- Industrialdesign(carcrashsimulation)- Bioinformatics- Cryptography
§ Allinvolvehugecomputationsonlargedataset
§ Supercomputers:CDC6600,CDC7600,Cray-1,…
§ In70s-80s,Supercomputerº VectorMachine
11
©KrsteAsanovic,2015CS252,Spring2015,Lecture9
VectorSupercomputers
§ EpitomizedbyCray-1,1976:§ ScalarUnit- Load/StoreArchitecture
§ VectorExtension- VectorRegisters- VectorInstructions
§ Implementation- HardwiredControl- HighlyPipelinedFunctionalUnits
- InterleavedMemorySystem-NoDataCaches-NoVirtualMemory
12[©CrayResearch,1976]
WU UCB CS252 SP17
Cray-1 Internals displayed at EPFL
13
Photograph by Rama, Wikimedia Commons, Cc-by-sa-2.0-fr
©KrsteAsanovic,2015CS252,Spring2015,Lecture9
VectorProgrammingModel
14
+ + + + + +
[0] [1] [VLR-1]
VectorArithmeticInstructionsvadd v3, v1, v2 v3
v2v1
ScalarRegisters
x0
x31VectorRegisters
v0
v31
[0] [1] [2] [VLRMAX-1]
VLRVectorLengthRegister
v1VectorLoadandStoreInstructionsvld v1, x1, x2
Base,x1 Stride,x2Memory
VectorRegister
©KrsteAsanovic,2015CS252,Spring2015,Lecture9
VectorCodeExample
15
# Scalar Codeli x4, 64
loop:fld f1, 0(x1)fld f2, 0(x2)fadd.d f3,f1,f2fsd f3, 0(x3)addi x1, 8addi x2, 8addi x3, 8subi x4, 1bnez x4, loop
# Vector Codeli x4, 64setvlr x4vfld v1, x1vfld v2, x2vfadd.d v3,v1,v2vfsd v3, x3
# C codefor (i=0; i<64; i++)C[i] = A[i] + B[i];
©KrsteAsanovic,2015CS252,Spring2015,Lecture9
Cray-1(1976)
16
SinglePortMemory
16banksof64-bitwords+8-bitSECDED
80MW/secdataload/store
320MW/secinstructionbufferrefill
4InstructionBuffers
64-bitx16 NIP
LIP
CIP
(A0)
((Ah)+jkm)
64TRegs
(A0)
((Ah)+jkm)
64BRegs
S0S1S2S3S4S5S6S7
A0A1A2A3A4A5A6A7
SiTjk
Ai
Bjk
FPAdd
FPMul
FPRecip
IntAdd
IntLogic
IntShift
PopCnt
Sj
Si
Sk
AddrAdd
AddrMul
Aj
Ai
Ak
memorybankcycle50nsprocessorcycle12.5ns(80MHz)
V0V1V2V3V4V5V6V7
Vk
Vj
Vi V.Mask
V.Length64ElementVectorRegisters
©KrsteAsanovic,2015CS252,Spring2015,Lecture9
VectorInstructionSetAdvantages
§ Compact- oneshortinstructionencodesNoperations
§ Expressive,tellshardwarethattheseNoperations:- areindependent- usethesamefunctionalunit- accessdisjointregisters- accessregistersinsamepatternaspreviousinstructions- accessacontiguousblockofmemory(unit-strideload/store)
- accessmemoryinaknownpattern(stridedload/store)
§ Scalable- canrunsamecodeonmoreparallelpipelines(lanes)
17
©KrsteAsanovic,2015CS252,Spring2015,Lecture9
VectorArithmeticExecution
18
• Usedeeppipeline(=>fastclock)toexecuteelementoperations
• Simplifiescontrolofdeeppipelinebecauseelementsinvectorareindependent(=>nohazards!)
v1
v2
v3
v3<- v1*v2
Six-stagemultiplypipeline
©KrsteAsanovic,2015CS252,Spring2015,Lecture9
VectorInstructionExecution
19
vfadd.d vc, va, vb
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
Executionusingonepipelinedfunctionalunit
C[4]
C[8]
C[0]
A[12] B[12]A[16] B[16]A[20] B[20]A[24] B[24]
C[5]
C[9]
C[1]
A[13] B[13]A[17] B[17]A[21] B[21]A[25] B[25]
C[6]
C[10]
C[2]
A[14] B[14]A[18] B[18]A[22] B[22]A[26] B[26]
C[7]
C[11]
C[3]
A[15] B[15]A[19] B[19]A[23] B[23]A[27] B[27]
Executionusingfourpipelinedfunctionalunits
©KrsteAsanovic,2015CS252,Spring2015,Lecture9
InterleavedVectorMemorySystem§ Bankbusytime:Timebeforebankreadytoacceptnextrequest
§ Cray-1,16banks,4cyclebankbusytime,12cyclelatency
20
0 1 2 3 4 5 6 7 8 9 A B C D E F
+
Base StrideVectorRegisters
MemoryBanks
AddressGenerator
©KrsteAsanovic,2015CS252,Spring2015,Lecture9
VectorUnitStructure
21
Lane
FunctionalUnit
VectorRegisters
MemorySubsystem
Elements0,4,8,…
Elements1,5,9,…
Elements2,6,10,…
Elements3,7,11,…
©KrsteAsanovic,2015CS252,Spring2015,Lecture9
T0VectorMicroprocessor(UCB/ICSI,1995)
22
LaneVectorregisterelementsstripedoverlanes
[0][8][16][24]
[1][9][17][25]
[2][10][18][26]
[3][11][19][27]
[4][12][20][28]
[5][13][21][29]
[6][14][22][30]
[7][15][23][31]
©KrsteAsanovic,2015CS252,Spring2015,Lecture9
VectorInstructionParallelism§ Canoverlapexecutionofmultiplevectorinstructions
- examplemachinehas32elementspervectorregisterand8lanes
23
load
loadmul
mul
add
add
Load Unit Multiply Unit Add Unit
time
Instruction issue
Complete24operations/cyclewhileissuing1shortinstruction/cycle
©KrsteAsanovic,2015CS252,Spring2015,Lecture9
VectorChaining
24
§ Vectorversionofregisterbypassing- introducedwithCray-1
Memory
V1
LoadUnit Mult.
V2 V3
Chain
Add
V4 V5
Chain
vld v1vfmul v3,v1,v2Vfadd v5, v3, v4
©KrsteAsanovic,2015CS252,Spring2015,Lecture9
VectorChainingAdvantage
25
• Withchaining,canstartdependentinstructionassoonasfirstresultappears
LoadMul
Add
LoadMul
AddTime
• Withoutchaining,mustwaitforlastelementofresulttobewrittenbeforestartingdependentinstruction
©KrsteAsanovic,2015CS252,Spring2015,Lecture9
VectorStartup§ Twocomponentsofvectorstartuppenalty
- functionalunitlatency(timethroughpipeline)- deadtimeorrecoverytime(timebeforeanothervectorinstructioncanstartdownpipeline)
26
R X X X WR X X X W
R X X X WR X X X W
R X X X WR X X X W
R X X X W
R X X X WR X X X W
R X X X W
FunctionalUnitLatency
DeadTime
FirstVectorInstruction
SecondVectorInstruction
DeadTime
©KrsteAsanovic,2015CS252,Spring2015,Lecture9
DeadTimeandShortVectors
27
Cray C90, Two lanes4 cycle dead timeMaximum efficiency 94% with 128 element vectors
4 cycles dead time T0, Eight lanesNo dead time100% efficiency with 8 element vectors
No dead time
64 cycles active
©KrsteAsanovic,2015CS252,Spring2015,Lecture9
VectorMemory-MemoryversusVectorRegisterMachines
§ Vectormemory-memoryinstructionsholdallvectoroperandsinmainmemory
§ Thefirstvectormachines,CDCStar-100(‘73)andTIASC(‘71),werememory-memorymachines
§ Cray-1(’76)wasfirstvectorregistermachine
28
for (i=0; i<N; i++){C[i] = A[i] + B[i];D[i] = A[i] - B[i];
}
Example Source Code ADDV C, A, BSUBV D, A, B
Vector Memory-Memory Code
LV V1, ALV V2, BADDV V3, V1, V2SV V3, CSUBV V4, V1, V2SV V4, D
Vector Register Code
©KrsteAsanovic,2015CS252,Spring2015,Lecture9
VectorMemory-Memoryvs.VectorRegisterMachines
§ Vectormemory-memoryarchitectures(VMMA)requiregreatermainmemorybandwidth,why?- Alloperandsmustbereadinandoutofmemory
§ VMMAsmakeifdifficulttooverlapexecutionofmultiplevectoroperations,why?- Mustcheckdependenciesonmemoryaddresses
§ VMMAsincurgreaterstartuplatency- ScalarcodewasfasteronCDCStar-100forvectors<100elements- ForCray-1,vector/scalarbreakevenpointwasaround2-4elements
§ ApartfromCDCfollow-ons(Cyber-205,ETA-10)allmajorvectormachinessinceCray-1havehadvectorregisterarchitectures
§ (weignorevectormemory-memoryfromnowon)
29
©KrsteAsanovic,2015CS252,Spring2015,Lecture9
AutomaticCodeVectorization
30
for (i=0; i < N; i++)C[i] = A[i] + B[i];
loadload
add
store
loadload
add
store
Iter.1
Iter.2
ScalarSequentialCode
Vectorization isamassivecompile-timereorderingofoperationsequencingÞ requiresextensiveloopdependenceanalysis
VectorInstruction
load
load
add
store
load
load
add
store
Iter.1 Iter.2
Vectorized Code
Time
©KrsteAsanovic,2015CS252,Spring2015,Lecture9
VectorStripminingProblem:VectorregistershavefinitelengthSolution:Breakloopsintopiecesthatfitinregisters,“Stripmining”
andi x1, xN, 63 # N mod 64setvlr x1 # Do remainderloop:vld v1, xAsll x2, x1, 3 # Multiply by 8 add xA, x2 # Bump pointervld v2, xBadd xB, x2vfadd.d v3, v1, v2vsd v3, xCadd xC, x2sub xN, x1 # Subtract elementsli x1, 64setvlr x1 # Reset full lengthbgtz xN, loop # Any more to do?
for (i=0; i<N; i++)C[i] = A[i]+B[i];
+
+
+
A B C
64elements
Remainder
31
©KrsteAsanovic,2015CS252,Spring2015,Lecture9
VectorConditionalExecution
32
Problem:Wanttovectorize loopswithconditionalcode:for (i=0; i<N; i++)
if (A[i]>0) thenA[i] = B[i];
Solution:Addvectormask (orflag)registers– vectorversionofpredicateregisters,1bitperelement
…andmaskable vectorinstructions– vectoroperationbecomesbubble(“NOP”)atelementswheremaskbitisclear
Codeexample:cvm # Turn on all elements vld vA, xA # Load entire A vectorvfsgts.d vA, f0 # Set bits in mask register where A>0vld vA, xB # Load B vector into A under maskvsd vA, xA # Store A back to memory under mask
©KrsteAsanovic,2015CS252,Spring2015,Lecture9
MaskedVectorInstructions
33
C[4]
C[5]
C[1]
Write data port
A[7] B[7]
M[3]=0
M[4]=1
M[5]=1
M[6]=0
M[2]=0
M[1]=1
M[0]=0
M[7]=1
Density-TimeImplementation– scanmaskvectorandonlyexecuteelementswithnon-zeromasks
C[1]
C[2]
C[0]
A[3] B[3]
A[4] B[4]
A[5] B[5]
A[6] B[6]
M[3]=0
M[4]=1
M[5]=1
M[6]=0
M[2]=0
M[1]=1
M[0]=0
Write data portWrite Enable
A[7] B[7]M[7]=1
SimpleImplementation– executeallNoperations,turnoffresultwriteback accordingtomask
©KrsteAsanovic,2015CS252,Spring2015,Lecture9
Compress/ExpandOperations
§ Compresspacksnon-maskedelementsfromonevectorregistercontiguouslyatstartofdestinationvectorregister- populationcountofmaskvectorgivespackedvectorlength
§ Expandperformsinverseoperation
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
A[3]
A[4]
A[5]
A[6]
A[7]
A[0]
A[1]
A[2]M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
B[3]A[4]A[5]B[6]A[7]
B[0]A[1]B[2]
Expand
A[7]
A[1]A[4]A[5]
Compress
A[7]
A[1]
A[4]
A[5]
Usedfordensity-timeconditionalsandalsoforgeneralselectionoperations
©KrsteAsanovic,2015CS252,Spring2015,Lecture9
VectorReductions
35
Problem:Loop-carrieddependenceonreductionvariablessum = 0;for (i=0; i<N; i++)
sum += A[i]; # Loop-carried dependence on sumSolution:Re-associateoperationsifpossible,usebinarytreetoperformreduction
# Rearrange as:sum[0:VL-1] = 0 # Vector of VL partial sumsfor(i=0; i<N; i+=VL) # Stripmine VL-sized chunks
sum[0:VL-1] += A[i:i+VL-1]; # Vector sum# Now have VL partial sums in one vector registerdo {
VL = VL/2; # Halve vector lengthsum[0:VL-1] += sum[VL:2*VL-1] # Halve no. of partials
} while (VL>1)
©KrsteAsanovic,2015CS252,Spring2015,Lecture9
VectorScatter/Gather
36
Wanttovectorize loopswithindirectaccesses:for (i=0; i<N; i++)
A[i] = B[i] + C[D[i]]
Indexedloadinstruction(Gather)vld vD, xD # Load indices in D vectorvdli vC, xC, vD # Load indirect from rC basevld vB, xB # Load B vectorvfadd.d vA,vB,vC # Do addvsd vA, xA # Store result
©KrsteAsanovic,2015CS252,Spring2015,Lecture9
VectorScatter/Gather
37
Histogramexample:for (i=0; i<N; i++)
A[B[i]]++;
Isfollowingacorrecttranslation?vld vB, xB # Load indices in B vectorvldi vA, xA, vB # Gather initial A valuesvadd vA, vA, 1 # Incrementvsdi vA, xA, vB # Scatter incremented values
©KrsteAsanovic,2015CS252,Spring2015,Lecture9
VectorMemoryModels
§ Mostvectormachineshaveaveryrelaxedmemorymodel,e.g.vsd v1, x1 # Store vector to x1vld v2, x1 # Load vector from x1- Noguaranteethatelementsofv2willhavevalueofelementsofv1evenwhenstoreand
loadexecutebysameprocessor!
§ Requiresexplicitmemorybarrierorfencevsd v1, x1 # Store vector to x1fence.vs.vl # Enforce ordering s->lvld v2, x1 # Load vector from x1
Vectormachinessupporthighlyparallelmemorysystems(multiplelanesandmultipleloadandstoreunits)withlonglatency(100+clockcycles)- hardwarecoherencecheckswouldbeprohibitivelyexpensive- vectorizing compilercaneliminatemostdependencies
©KrsteAsanovic,2015CS252,Spring2015,Lecture9
ARecentVectorSuper:NECSX-9(2008)
39
§ 65nmCMOStechnology§ Vectorunit(3.2GHz)- 8foregroundVRegs +64backgroundVRegs (256x64-bitelements/VReg)
- 64-bitfunctionalunits:2multiply,2add,1divide/sqrt,1logical,1maskunit
- 8lanes(32+FLOPS/cycle,100+GFLOPSpeakperCPU)
- 1loadorstoreunit(8x8-byteaccesses/cycle)
§ Scalarunit(1.6GHz)- 4-waysuperscalarwithout-of-orderandspeculativeexecution
- 64KBI-cacheand64KBdatacache
•Memorysystemprovides256GB/sDRAMbandwidthperCPU• Upto16CPUsandupto1TBDRAMformshared-memorynode
– totalof4TB/sbandwidthtosharedDRAMmemory
• Upto512nodesconnectedvia128GB/snetworklinks(messagepassingbetweennodes)
[©NEC]
[NewannouncementSX-ACE,4x16-lanevectorCPUsononechip]
©KrsteAsanovic,2015CS252,Spring2015,Lecture9
Acknowledgements
§ ThiscourseispartlyinspiredbypreviousMIT6.823andBerkeleyCS252computerarchitecturecoursescreatedbymycollaboratorsandcolleagues:- Arvind (MIT)- JoelEmer (Intel/MIT)- JamesHoe(CMU)- JohnKubiatowicz (UCB)- DavidPatterson(UCB)
40