CS252 Spring 2017 Graduate Computer Architecture Lecture 9 ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec09.pdf · WU UCB CS252 SP17 Last Time in Lecture 8 Overcoming the

WU UCB CS252 SP17

CS252 Spring 2017Graduate Computer Architecture

Lecture 9:Vector Supercomputers

Lisa Wu, Krste Asanovichttp://inst.eecs.berkeley.edu/~cs252/sp17

WU UCB CS252 SP17

Last Time in Lecture 8

Overcoming the worst hazards in OoO superscalars:• Branch prediction

• Bimodel• Local/Branch History Table• Global/gselect, gshare• Tournament• Branch address cache (predict multiple branches per cycle)• Trace cache• Return Address Predictors

• Today - Load/Store Queues, Vector Supercomputers

2

©KrsteAsanovic,2015CS252,Spring2015,Lecture9

Load-StoreQueueDesign

§ Aftercontrolhazards,datahazardsthroughmemoryareprobablynextmostimportantbottlenecktosuperscalarperformance

§ Modernsuperscalarsuseverysophisticatedload-storereorderingtechniquestoreduceeffectivememorylatencybyallowingloadstobespeculativelyissued

3


SpeculativeStoreBuffer§ Justlikeregisterupdates,storesshouldnotmodifythememoryuntilaftertheinstructioniscommitted.Aspeculativestorebufferisastructureintroducedtoholdspeculativestoredata.

§ Duringdecode,storebufferslotallocatedinprogramorder

§ Storessplitinto“storeaddress”and“storedata”micro-operations

§ “Storeaddress”executionwritestag§ “Storedata”executionwritesdata§ Storecommitswhenoldestinstructionandbothaddressanddataavailable:- clearspeculativebitandeventuallymovedatatocache

§ Onstoreabort:- clearvalidbit

4

DataTags

StoreCommitPath

SpeculativeStoreBuffer

L1DataCache

Tag DataSVTag DataSVTag DataSVTag DataSVTag DataSVTag DataSV

StoreAddress

StoreData


Loadbypassfromspeculativestorebuffer

5

§ Ifdatainbothstorebufferandcache,whichshouldweuse?Speculativestorebuffer

§ Ifsameaddressinstorebuffertwice,whichshouldweuse?Youngeststoreolderthanload

Data

LoadAddress

Tags

SpeculativeStoreBuffer L1DataCache

LoadData

Tag DataSVTag DataSVTag DataSVTag DataSVTag DataSVTag DataSV


MemoryDependenciessd x1, (x2)ld x3, (x4)

§ Whencanweexecutetheload?

6


In-OrderMemoryQueue

§ Executeallloadsandstoresinprogramorder

§ =>LoadandstorecannotleaveROBforexecutionuntilallpreviousloadsandstoreshavecompletedexecution

§ Canstillexecuteloadsandstoresspeculatively,andout-of-orderwithrespecttootherinstructions

§ Needastructuretohandlememoryordering…

7


ConservativeO-o-OLoadExecutionsd x1, (x2)ld x3, (x4)

§ Canexecuteloadbeforestore,ifaddressesknownandx4 !=x2

§ Eachloadaddresscomparedwithaddressesofallpreviousuncommittedstores- canusepartialconservativechecki.e.,bottom12bitsofaddress,tosavehardware

§ Don’texecuteloadifanypreviousstoreaddressnotknown

§ (MIPSR10K,16-entryaddressqueue)

8


AddressSpeculationsd x1, (x2)ld x3, (x4)

§ Guessthatx4 !=x2§ Executeloadbeforestoreaddressknown§ Needtoholdallcompletedbutuncommittedload/storeaddressesinprogramorder

§ Ifsubsequentlyfindx4==x2,squashloadandallfollowinginstructions

§ =>Largepenaltyforinaccurateaddressspeculation

9


MemoryDependencePrediction(Alpha21264)

sd x1, (x2)ld x3, (x4)

§ Guessthatx4 !=x2 andexecuteloadbeforestore

§ Iflaterfindx4==x2,squashloadandallfollowinginstructions,butmarkloadinstructionasstore-wait

§ Subsequentexecutionsofthesameloadinstructionwillwaitforallpreviousstorestocomplete

§ Periodicallyclearstore-waitbits10


SupercomputerApplications

§ Typicalapplicationareas- Militaryresearch(nuclearweapons,cryptography)- Scientificresearch- Weatherforecasting- Oilexploration- Industrialdesign(carcrashsimulation)- Bioinformatics- Cryptography

§ Allinvolvehugecomputationsonlargedataset

§ Supercomputers:CDC6600,CDC7600,Cray-1,…

§ In70s-80s,Supercomputerº VectorMachine

11


VectorSupercomputers

§ EpitomizedbyCray-1,1976:§ ScalarUnit- Load/StoreArchitecture

§ VectorExtension- VectorRegisters- VectorInstructions

§ Implementation- HardwiredControl- HighlyPipelinedFunctionalUnits

- InterleavedMemorySystem-NoDataCaches-NoVirtualMemory

12[©CrayResearch,1976]

WU UCB CS252 SP17

Cray-1 Internals displayed at EPFL

13

Photograph by Rama, Wikimedia Commons, Cc-by-sa-2.0-fr


VectorProgrammingModel

14

+ + + + + +

[0] [1] [VLR-1]

VectorArithmeticInstructionsvadd v3, v1, v2 v3

v2v1

ScalarRegisters

x0

x31VectorRegisters

v0

v31

[0] [1] [2] [VLRMAX-1]

VLRVectorLengthRegister

v1VectorLoadandStoreInstructionsvld v1, x1, x2

Base,x1 Stride,x2Memory

VectorRegister


VectorCodeExample

15

# Scalar Codeli x4, 64

loop:fld f1, 0(x1)fld f2, 0(x2)fadd.d f3,f1,f2fsd f3, 0(x3)addi x1, 8addi x2, 8addi x3, 8subi x4, 1bnez x4, loop

# Vector Codeli x4, 64setvlr x4vfld v1, x1vfld v2, x2vfadd.d v3,v1,v2vfsd v3, x3

# C codefor (i=0; i<64; i++)C[i] = A[i] + B[i];


Cray-1(1976)

16

SinglePortMemory

16banksof64-bitwords+8-bitSECDED

80MW/secdataload/store

320MW/secinstructionbufferrefill

4InstructionBuffers

64-bitx16 NIP

LIP

CIP

(A0)

((Ah)+jkm)

64TRegs

(A0)

((Ah)+jkm)

64BRegs

S0S1S2S3S4S5S6S7

A0A1A2A3A4A5A6A7

SiTjk

Ai

Bjk

FPAdd

FPMul

FPRecip

IntAdd

IntLogic

IntShift

PopCnt

Sj

Si

Sk

AddrAdd

AddrMul

Aj

Ai

Ak

memorybankcycle50nsprocessorcycle12.5ns(80MHz)

V0V1V2V3V4V5V6V7

Vk

Vj

Vi V.Mask

V.Length64ElementVectorRegisters


VectorInstructionSetAdvantages

§ Compact- oneshortinstructionencodesNoperations

§ Expressive,tellshardwarethattheseNoperations:- areindependent- usethesamefunctionalunit- accessdisjointregisters- accessregistersinsamepatternaspreviousinstructions- accessacontiguousblockofmemory(unit-strideload/store)

- accessmemoryinaknownpattern(stridedload/store)

§ Scalable- canrunsamecodeonmoreparallelpipelines(lanes)

17


VectorArithmeticExecution

18

• Usedeeppipeline(=>fastclock)toexecuteelementoperations

• Simplifiescontrolofdeeppipelinebecauseelementsinvectorareindependent(=>nohazards!)

v1

v2

v3

v3<- v1*v2

Six-stagemultiplypipeline


VectorInstructionExecution

19

vfadd.d vc, va, vb

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

Executionusingonepipelinedfunctionalunit

C[4]

C[8]

C[0]

A[12] B[12]A[16] B[16]A[20] B[20]A[24] B[24]

C[5]

C[9]

C[1]

A[13] B[13]A[17] B[17]A[21] B[21]A[25] B[25]

C[6]

C[10]

C[2]

A[14] B[14]A[18] B[18]A[22] B[22]A[26] B[26]

C[7]

C[11]

C[3]

A[15] B[15]A[19] B[19]A[23] B[23]A[27] B[27]

Executionusingfourpipelinedfunctionalunits


InterleavedVectorMemorySystem§ Bankbusytime:Timebeforebankreadytoacceptnextrequest

§ Cray-1,16banks,4cyclebankbusytime,12cyclelatency

20

0 1 2 3 4 5 6 7 8 9 A B C D E F

+

Base StrideVectorRegisters

MemoryBanks

AddressGenerator


VectorUnitStructure

21

Lane

FunctionalUnit

VectorRegisters

MemorySubsystem

Elements0,4,8,…

Elements1,5,9,…

Elements2,6,10,…

Elements3,7,11,…


T0VectorMicroprocessor(UCB/ICSI,1995)

22

LaneVectorregisterelementsstripedoverlanes

[0][8][16][24]

[1][9][17][25]

[2][10][18][26]

[3][11][19][27]

[4][12][20][28]

[5][13][21][29]

[6][14][22][30]

[7][15][23][31]


VectorInstructionParallelism§ Canoverlapexecutionofmultiplevectorinstructions

- examplemachinehas32elementspervectorregisterand8lanes

23

load

loadmul

mul

add

add

Load Unit Multiply Unit Add Unit

time

Instruction issue

Complete24operations/cyclewhileissuing1shortinstruction/cycle


VectorChaining

24

§ Vectorversionofregisterbypassing- introducedwithCray-1

Memory

V1

LoadUnit Mult.

V2 V3

Chain

Add

V4 V5

Chain

vld v1vfmul v3,v1,v2Vfadd v5, v3, v4


VectorChainingAdvantage

25

• Withchaining,canstartdependentinstructionassoonasfirstresultappears

LoadMul

Add

LoadMul

AddTime

• Withoutchaining,mustwaitforlastelementofresulttobewrittenbeforestartingdependentinstruction


VectorStartup§ Twocomponentsofvectorstartuppenalty

- functionalunitlatency(timethroughpipeline)- deadtimeorrecoverytime(timebeforeanothervectorinstructioncanstartdownpipeline)

26

R X X X WR X X X W

R X X X WR X X X W

R X X X WR X X X W

R X X X W

R X X X WR X X X W

R X X X W

FunctionalUnitLatency

DeadTime

FirstVectorInstruction

SecondVectorInstruction

DeadTime


DeadTimeandShortVectors

27

Cray C90, Two lanes4 cycle dead timeMaximum efficiency 94% with 128 element vectors

4 cycles dead time T0, Eight lanesNo dead time100% efficiency with 8 element vectors

No dead time

64 cycles active


VectorMemory-MemoryversusVectorRegisterMachines

§ Vectormemory-memoryinstructionsholdallvectoroperandsinmainmemory

§ Thefirstvectormachines,CDCStar-100(‘73)andTIASC(‘71),werememory-memorymachines

§ Cray-1(’76)wasfirstvectorregistermachine

28

for (i=0; i<N; i++){C[i] = A[i] + B[i];D[i] = A[i] - B[i];

}

Example Source Code ADDV C, A, BSUBV D, A, B

Vector Memory-Memory Code

LV V1, ALV V2, BADDV V3, V1, V2SV V3, CSUBV V4, V1, V2SV V4, D

Vector Register Code


VectorMemory-Memoryvs.VectorRegisterMachines

§ Vectormemory-memoryarchitectures(VMMA)requiregreatermainmemorybandwidth,why?- Alloperandsmustbereadinandoutofmemory

§ VMMAsmakeifdifficulttooverlapexecutionofmultiplevectoroperations,why?- Mustcheckdependenciesonmemoryaddresses

§ VMMAsincurgreaterstartuplatency- ScalarcodewasfasteronCDCStar-100forvectors<100elements- ForCray-1,vector/scalarbreakevenpointwasaround2-4elements

§ ApartfromCDCfollow-ons(Cyber-205,ETA-10)allmajorvectormachinessinceCray-1havehadvectorregisterarchitectures

§ (weignorevectormemory-memoryfromnowon)

29


AutomaticCodeVectorization

30

for (i=0; i < N; i++)C[i] = A[i] + B[i];

loadload

add

store

loadload

add

store

Iter.1

Iter.2

ScalarSequentialCode

Vectorization isamassivecompile-timereorderingofoperationsequencingÞ requiresextensiveloopdependenceanalysis

VectorInstruction

load

load

add

store

load

load

add

store

Iter.1 Iter.2

Vectorized Code

Time


VectorStripminingProblem:VectorregistershavefinitelengthSolution:Breakloopsintopiecesthatfitinregisters,“Stripmining”

andi x1, xN, 63 # N mod 64setvlr x1 # Do remainderloop:vld v1, xAsll x2, x1, 3 # Multiply by 8 add xA, x2 # Bump pointervld v2, xBadd xB, x2vfadd.d v3, v1, v2vsd v3, xCadd xC, x2sub xN, x1 # Subtract elementsli x1, 64setvlr x1 # Reset full lengthbgtz xN, loop # Any more to do?

for (i=0; i<N; i++)C[i] = A[i]+B[i];

+

+

+

A B C

64elements

Remainder

31


VectorConditionalExecution

32

Problem:Wanttovectorize loopswithconditionalcode:for (i=0; i<N; i++)

if (A[i]>0) thenA[i] = B[i];

Solution:Addvectormask (orflag)registers– vectorversionofpredicateregisters,1bitperelement

…andmaskable vectorinstructions– vectoroperationbecomesbubble(“NOP”)atelementswheremaskbitisclear

Codeexample:cvm # Turn on all elements vld vA, xA # Load entire A vectorvfsgts.d vA, f0 # Set bits in mask register where A>0vld vA, xB # Load B vector into A under maskvsd vA, xA # Store A back to memory under mask


MaskedVectorInstructions

33

C[4]

C[5]

C[1]

Write data port

A[7] B[7]

M[3]=0

M[4]=1

M[5]=1

M[6]=0

M[2]=0

M[1]=1

M[0]=0

M[7]=1

Density-TimeImplementation– scanmaskvectorandonlyexecuteelementswithnon-zeromasks

C[1]

C[2]

C[0]

A[3] B[3]

A[4] B[4]

A[5] B[5]

A[6] B[6]

M[3]=0

M[4]=1

M[5]=1

M[6]=0

M[2]=0

M[1]=1

M[0]=0

Write data portWrite Enable

A[7] B[7]M[7]=1

SimpleImplementation– executeallNoperations,turnoffresultwriteback accordingtomask


Compress/ExpandOperations

§ Compresspacksnon-maskedelementsfromonevectorregistercontiguouslyatstartofdestinationvectorregister- populationcountofmaskvectorgivespackedvectorlength

§ Expandperformsinverseoperation

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

A[3]

A[4]

A[5]

A[6]

A[7]

A[0]

A[1]

A[2]M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

B[3]A[4]A[5]B[6]A[7]

B[0]A[1]B[2]

Expand

A[7]

A[1]A[4]A[5]

Compress

A[7]

A[1]

A[4]

A[5]

Usedfordensity-timeconditionalsandalsoforgeneralselectionoperations


VectorReductions

35

Problem:Loop-carrieddependenceonreductionvariablessum = 0;for (i=0; i<N; i++)

sum += A[i]; # Loop-carried dependence on sumSolution:Re-associateoperationsifpossible,usebinarytreetoperformreduction

# Rearrange as:sum[0:VL-1] = 0 # Vector of VL partial sumsfor(i=0; i<N; i+=VL) # Stripmine VL-sized chunks

sum[0:VL-1] += A[i:i+VL-1]; # Vector sum# Now have VL partial sums in one vector registerdo {

VL = VL/2; # Halve vector lengthsum[0:VL-1] += sum[VL:2*VL-1] # Halve no. of partials

} while (VL>1)


VectorScatter/Gather

36

Wanttovectorize loopswithindirectaccesses:for (i=0; i<N; i++)

A[i] = B[i] + C[D[i]]

Indexedloadinstruction(Gather)vld vD, xD # Load indices in D vectorvdli vC, xC, vD # Load indirect from rC basevld vB, xB # Load B vectorvfadd.d vA,vB,vC # Do addvsd vA, xA # Store result


VectorScatter/Gather

37

Histogramexample:for (i=0; i<N; i++)

A[B[i]]++;

Isfollowingacorrecttranslation?vld vB, xB # Load indices in B vectorvldi vA, xA, vB # Gather initial A valuesvadd vA, vA, 1 # Incrementvsdi vA, xA, vB # Scatter incremented values


VectorMemoryModels

§ Mostvectormachineshaveaveryrelaxedmemorymodel,e.g.vsd v1, x1 # Store vector to x1vld v2, x1 # Load vector from x1- Noguaranteethatelementsofv2willhavevalueofelementsofv1evenwhenstoreand

loadexecutebysameprocessor!

§ Requiresexplicitmemorybarrierorfencevsd v1, x1 # Store vector to x1fence.vs.vl # Enforce ordering s->lvld v2, x1 # Load vector from x1

Vectormachinessupporthighlyparallelmemorysystems(multiplelanesandmultipleloadandstoreunits)withlonglatency(100+clockcycles)- hardwarecoherencecheckswouldbeprohibitivelyexpensive- vectorizing compilercaneliminatemostdependencies


ARecentVectorSuper:NECSX-9(2008)

39

§ 65nmCMOStechnology§ Vectorunit(3.2GHz)- 8foregroundVRegs +64backgroundVRegs (256x64-bitelements/VReg)

- 64-bitfunctionalunits:2multiply,2add,1divide/sqrt,1logical,1maskunit

- 8lanes(32+FLOPS/cycle,100+GFLOPSpeakperCPU)

- 1loadorstoreunit(8x8-byteaccesses/cycle)

§ Scalarunit(1.6GHz)- 4-waysuperscalarwithout-of-orderandspeculativeexecution

- 64KBI-cacheand64KBdatacache

•Memorysystemprovides256GB/sDRAMbandwidthperCPU• Upto16CPUsandupto1TBDRAMformshared-memorynode

– totalof4TB/sbandwidthtosharedDRAMmemory

• Upto512nodesconnectedvia128GB/snetworklinks(messagepassingbetweennodes)

[©NEC]

[NewannouncementSX-ACE,4x16-lanevectorCPUsononechip]


Acknowledgements

§ ThiscourseispartlyinspiredbypreviousMIT6.823andBerkeleyCS252computerarchitecturecoursescreatedbymycollaboratorsandcolleagues:- Arvind (MIT)- JoelEmer (Intel/MIT)- JamesHoe(CMU)- JohnKubiatowicz (UCB)- DavidPatterson(UCB)

40

Documents

CS252 Spring 2017 Graduate Computer Architecture Lecture 9 ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec09.pdf · WU UCB CS252 SP17 Last Time in Lecture 8 Overcoming the