Upload
doanxuyen
View
221
Download
6
Embed Size (px)
Citation preview
CS61C:GreatIdeasinComputerArchitecture(MachineStructures)
CachesPart2
Instructors:BernhardBoser &RandyH.Katz
http://inst.eecs.berkeley.edu/~cs61c/
10/18/16 Fall2016- Lecture#15 1
Outline
• CacheOrganizationandPrinciples• WriteBackvs.WriteThrough• CachePerformance• CacheDesignTradeoffs• AndinConclusion…
10/18/16 Fall2016– Lecture#15 2
Outline
• CacheOrganizationandPrinciples• WriteBackvs.WriteThrough• CachePerformance• CacheDesignTradeoffs• AndinConclusion…
10/18/16 Fall2016– Lecture#15 3
Second-LevelCache(SRAM)
TypicalMemoryHierarchy
Control
Datapath
SecondaryMemory(Disk
OrFlash)
On-ChipComponents
RegFile
MainMemory(DRAM)Data
CacheInstrCache
Speed(cycles):½’s 1’s 10’s 100’s-10001,000,000’s
Size(bytes): 100’s 10K’s M’sG’sT’s
• Principleoflocality+memoryhierarchypresentsprogrammerwith≈asmuchmemoryasisavailableinthecheapest technologyatthe≈speedofferedbythefastest technology
Cost/bit:highest lowest
Third-LevelCache(SRAM)
10/18/16 Fall2016- Lecture#15 4
Processor
Control
Datapath
AddingCachetoComputer
PC
Registers
Arithmetic&LogicUnit(ALU)
MemoryInput
Output
Bytes
Enable?Read/Write
Address
WriteData
ReadData
Processor-Memory Interface I/O-MemoryInterfaces
Program
Data
Cache
10/18/16 Fall2016- Lecture#15 5
Processororganizedaroundwordsand bytes
Memory (includingcache)organizedaroundblocks,
whicharetypicallymultiple words
KeyCacheConcepts• PrincipleofLocality– TemporalLocalityandSpatialLocality
• HierarchyofMemories (speed/size/costperbit)toexploitlocality
• Cache– copyofdatainlowerlevelofmemoryhierarchy
• DirectMappedtofindblockincacheusingTagfieldandValidbitforHit
• CacheDesignOrganizationChoices:– FullyAssociative,Set-Associative,Direct-Mapped
610/18/16 Fall2016- Lecture#15
CacheOrganizations• “FullyAssociative”:Blockplacedanywhereincache– Firstdesignlastlecture– Note:NoIndexfield,butonecomparator/block
• “DirectMapped”:Blockgoesonlyoneplaceincache– Note:Onlyonecomparator– Numberofsets=numberblocks
• “N-waySetAssociative”:Nplacesforblockincache– Numberofsets=NumberofBlocks/N– Ncomparators– FullyAssociative:N=numberofblocks– DirectMapped:N=1
10/18/16 Fall2016- Lecture#15 7
0000000001000100001100100001010011000111010000100101010010110110001101011100111110000100011001010011101001010110110101111100011001110101101111100111011111011111
0000000001000100001100100001010011000111010000100101010010110110001101011100111110000100011001010011101001010110110101111100011001110101101111100111011111011111
0000000001000100001100100001010011000111010000100101010010110110001101011100111110000100011001010011101001010110110101111100011001110101101111100111011111011111
8 88Byte
Word8-Byte Block
address address address
2 LSBs are 0 3 LSBs are 0
0
1
2
3
01234567012345670123456701234567
Byte offset in blockBlock #
MemoryBlockvs.WordAddressing
10/18/16 Fall2016- Lecture#15 8
010100100000
010100110000
010101000000
010101010000
010101100000
010101110000
010110000000
010110010000
010110100000
010110110000
010100100000
010100110000
010101000000
010101010000
010101100000
010101110000
010110000000
010110010000
010110100000
010110110000
82
83
84
85
86
87
88
89
90
91
2
3
4
5
6
7
0
1
2
3
0
1
0
1
0
1
0
1
0
1
010100100000
010100110000
010101000000
010101010000
010101100000
010101110000
010110000000
010110010000
010110100000
010110110000
MemoryBlockNumberAliasing
Block# Block#mod8 Block#mod2
12-bitmemoryaddresses,16Byteblocks
10/18/16 Fall2016- Lecture#15 9
ProcessorAddressFieldsusedbyCacheController
• BlockOffset:Byteaddresswithinblock• SetIndex:Selectswhichset• Tag:Remainingportionofprocessoraddress
• SizeofIndex=log2(numberofsets)• SizeofTag=Addresssize– SizeofIndex
– log2(numberofbytes/block)
Block offsetSetIndexTag
ProcessorAddress(32-bitstotal)
10/18/16 Fall2016- Lecture#15 10
• Onewordblocks,cachesize=1Kwords(or4KB)
Direct-MappedCacheRevisted
20Tag 10Index
DataIndex TagValid012...
102110221023
3130 ... 131211 ... 210Byteoffset
20
Data
32
HitValidbitensures
somethingusefulincacheforthisindex
CompareTagwithupperpartofAddresstoseeifa
Hit
Readdatafromcache
insteadofmemoryif
aHit
Comparator
10/18/16 Fall2016- Lecture#15 11
Four-WaySet-AssociativeCache• 28 =256setseachwithfourways(eachwithoneblock)
3130 ... 131211... 210 Byteoffset
DataTagV012...
253254255
DataTagV012...
253254255
DataTagV012...
253254255
SetIndex
DataTagV012...
253254255
8Index
22Tag
Hit Data
32
4x1select
Way0 Way1 Way2 Way3
10/18/16 Fall2016- Lecture#15 12
Outline
• CacheOrganizationandPrinciples• WriteBackvs.WriteThrough• CachePerformance• CacheDesignTradeoffs• AndinConclusion…
10/18/16 Fall2016– Lecture#15 13
HandlingStoreswithWrite-Through
• Storeinstructionswritetomemory,changingvalues
• Needtomakesurecacheandmemoryhavesamevaluesonwrites:twopolicies
1)Write-ThroughPolicy:writecacheandwritethroughthecachetomemory– Everywriteeventuallygetstomemory– Tooslow,soincludeWriteBuffertoallowprocessortocontinueoncedatainBuffer
– Bufferupdatesmemoryinparalleltoprocessor
10/18/16 Fall2016- Lecture#15 14
Write-ThroughCache
• Writebothvaluesincacheandinmemory
• WritebufferstopsCPUfromstallingifmemorycannotkeepup
• Writebuffermayhavemultipleentriestoabsorbburstsofwrites
• Whatifstoremissesincache?
Processor
32-bitAddress
32-bitData
Cache
32-bitAddress
32-bitData
Memory
1022 99252
720
12
1312041 Addr Data
WriteBuffer
10/18/16 Fall2016- Lecture#15 15
HandlingStoreswithWrite-Back
2)Write-BackPolicy:writeonlytocacheandthenwritecacheblockbacktomemorywhenevictblockfromcache–Writescollectedincache,onlysinglewritetomemoryperblock
– Includebittoseeifwrotetoblockornot,andthenonlywritebackifbitisset• Called“Dirty”bit(writingmakesit“dirty”)
10/18/16 Fall2016- Lecture#15 16
Write-BackCache
• Store/cachehit,writedataincacheonlyandsetdirtybit– Memoryhasstalevalue
• Store/cachemiss,readdatafrommemory,thenupdateandsetdirtybit– “Write-allocate”policy
• Load/cachehit,usevaluefromcache
• Onanymiss,writebackevictedblock,onlyifdirty.Updatecachewithnewblockandcleardirtybit
Processor
32-bitAddress
32-bitData
Cache
32-bitAddress
32-bitData
Memory
1022 99252
720
12
1312041
DDDD
DirtyBits
10/18/16 Fall2016- Lecture#15 17
Write-Throughvs.Write-Back
• Write-Through:– Simplercontrollogic– Morepredictabletimingsimplifiesprocessorcontrollogic
– Easiertomakereliable,sincememoryalwayshascopyofdata(bigidea:Redundancy!)
• Write-Back– Morecomplexcontrollogic– Morevariabletiming(0,1,2memoryaccessespercacheaccess)
– Usuallyreduceswritetraffic
– Hardertomakereliable,sometimescachehasonlycopyofdata
10/18/16 Fall2016- Lecture#15 18
Administrivia• Midterm#22weeksaway!
November1!– Inclass!3:40-5PM– Synchronousdigitaldesignand
Project3(processordesign)included
– PipelinesandCaches– ONEDoublesidedCribsheet– ReviewSession,Sunday,10/30,
1-3PM,10Evans
1910/18/16 Fall2016- Lecture#15
iClicker Saga
10/18/16 Fall2016-- Lecture#15 20
iClicker andEPA
• Nolongertakingattendanceinlecture– butwehopeyouwillcontinuetocomeanyway
• ContinuetouseClickerquestionsinlecturetohelpyoutestyourunderstanding
• EPAwillbebasedona“holistic”assessmentoflecture,piazza,guerrillaandtutoringsessions,officehours,discussion,andlabparticipation
• EPAwillbecalculatedsoastoonlyhelpyourcoursegrade,neverhurtit
10/18/16 Fall2016-- Lecture#15 21
Outline
• CacheOrganizationandPrinciples• WriteBackvs.WriteThrough• CachePerformance• CacheDesignTradeoffs• AndinConclusion…
10/18/16 Fall2016– Lecture#15 22
Cache(Performance) Terms
• Hitrate:fractionofaccessesthathitinthecache• Missrate:1– Hitrate• Misspenalty:timetoreplaceablockfromlowerlevelinmemoryhierarchytocache
• Hittime:timetoaccesscachememory(includingtagcomparison)
• Abbreviation:“$”=cache(aBerkeleyinnovation!)
10/18/16 Fall2016- Lecture#15 23
AverageMemoryAccessTime(AMAT)• AverageMemoryAccessTime(AMAT)istheaveragetimetoaccessmemoryconsideringbothhitsandmissesinthecache
AMAT= Timeforahit+Missrate× Misspenalty
10/18/16 Fall2016- Lecture#15 24
Clickers/PeerInstruction
AMAT=Timeforahit+MissratexMisspenalty• Givena200psec clock,amisspenaltyof50clockcycles,amissrateof0.02missesperinstructionandacachehittimeof1clockcycle,whatisAMAT?A:≤200psecB:400psecC:600psecD: 800psec
2510/18/16 Fall2016- Lecture#15
Clickers/PeerInstruction
AMAT=Timeforahit+MissratexMisspenalty• Givena200psec clock,amisspenaltyof50clockcycles,amissrateof0.02missesperinstructionandacachehittimeof1clockcycle,whatisAMAT?A:≤200psecB:400psecC:600psecD: 800psec
2610/18/16 Fall2016- Lecture#15
1clockcycle+.02*50clockcycles=2clockcycles
PingPongCacheExample:Direct-MappedCachew/4Single-WordBlocks,Worst-CaseReferenceString
0 4 0 4
0 4 0 4
• Considerthemainmemoryaddressreferencestringofwordnumbers:04040404
Startwithanemptycache- allblocksinitiallymarkedasnotvalid
10/18/16 Fall2016- Lecture#15 27
0 4 0 4
0 4 0 4
miss miss miss miss
miss miss miss miss
00Mem(0) 00Mem(0)01 4
01Mem(4)000
00Mem(0)01 4
00Mem(0)01 4
00Mem(0)01 4
01Mem(4)000
01Mem(4)000
Startwithanemptycache- allblocksinitiallymarkedasnotvalid
Ping-pong effectduetoconflictmisses- twomemorylocationsthatmapintothesamecacheblock
• 8requests,8misses
• Considerthemainmemoryaddressreferencestringofwordnumbers:04040404
10/18/16 Fall2016- Lecture#15 28
PingPongCacheExample:Direct-MappedCachew/4Single-WordBlocks,Worst-CaseReferenceString
Outline
• CacheOrganizationandPrinciples• WriteBackvs.WriteThrough• CachePerformance• CacheDesignTradeoffs• AndinConclusion…
10/18/16 Fall2016– Lecture#15 29
AlternativeBlockPlacementSchemes
• DMplacement:mem block12in8blockcache:onlyonecacheblockwheremem block12canbefound—(12modulo8)=4
• SAplacement:foursetsx 2-ways(8cacheblocks),memoryblock12inset(12mod4)=0;eitherelementoftheset
• FAplacement:mem block12canappearinanycacheblocks10/18/16 Fall2016- Lecture#15 30
Example:2-WaySetAssociative$(4words=2setsx2waysperset)
0
Cache
MainMemory
Q:Howdowefindit?
Usenext1lowordermemoryaddressbittodeterminewhichcacheset(i.e.,modulothenumberofsetsinthecache)
Tag Data
Q:Isitthere?
Compareall thecachetagsinthesettothehighorder3memoryaddressbits totellifthememoryblockisinthecache
V
0000xx0001xx0010xx0011xx0100xx0101xx0110xx0111xx1000xx1001xx1010xx1011xx1100xx1101xx1110xx1111xx
Set
1
01
Way
0
1
OnewordblocksTwoloworderbitsdefine thebyteintheword(32bwords)
10/18/16 Fall2016- Lecture#15 31
PingPongCacheExample:4Word2-WaySA$,SameReferenceString
0 4 0 4
• Considerthemainmemorywordreferencestring04040404Startwithanemptycache- allblocks
initiallymarkedasnotvalid
10/18/16 Fall2016- Lecture#15 32
PingPongCacheExample:4-Word2-WaySA$,SameReferenceString
0 4 0 4
• Considerthemainmemoryaddressreferencestring04040404
miss miss hit hit
000Mem(0) 000Mem(0)
Startwithanemptycache- allblocksinitiallymarkedasnotvalid
010Mem(4) 010Mem(4)
000Mem(0) 000Mem(0)
010Mem(4)
• Solvestheping-pongeffectinadirect-mappedcacheduetoconflictmissessincenowtwomemorylocationsthatmapintothesamecachesetcanco-exist!
• 8requests,2misses
10/18/16 Fall2016- Lecture#15 33
Four-WaySet-AssociativeCache• 28 =256setseachwithfourways(eachwithoneblock)
3130 ... 131211... 210 Byteoffset
DataTagV012...
253254255
DataTagV012...
253254255
DataTagV012...
253254255
Index DataTagV012...
253254255
8Index
22Tag
Hit Data
32
4x1select
Way0 Way1 Way2 Way3
10/18/16 Fall2016- Lecture#15 34
AlternativeOrganizationsofanEight-BlockCache
Totalsizeof$inblocksisequaltonumberofsets× associativity.Forfixed$sizeandfixedblocksize,increasing associativitydecreasesnumberofsetswhileincreasingnumberofelementsperset.Witheightblocks,an8-wayset-associative$issameasafullyassociative$.
10/18/16 Fall2016- Lecture#15 35
RangeofSet-AssociativeCaches• Forafixed-sizecacheandfixedblocksize,eachincreasebyafactoroftwoinassociativitydoublesthenumberofblocksperset(i.e.,thenumberorways)andhalvesthenumberofsets– decreasesthesizeoftheindexby1bitandincreasesthesizeofthetagby1bit
Wordoffset ByteoffsetIndexTag
10/18/16 Fall2016- Lecture#15 36
RangeofSet-AssociativeCaches• Forafixed-sizecacheandfixedblocksize,eachincreasebyafactoroftwoinassociativitydoublesthenumberofblocksperset(i.e.,thenumberorways)andhalvesthenumberofsets– decreasesthesizeoftheindexby1bitandincreasesthesizeofthetagby1bit
Wordoffset ByteoffsetIndexTag
Decreasingassociativity,lowerway,moresets
Fullyassociative(onlyoneset)Tagisallthebitsexceptblockandbyteoffset
Directmapped(onlyoneway)Smallertags,onlyasinglecomparator
Increasingassociativity,higherway,lesssets
SelectsthesetUsedfortagcompare Selectsthewordintheblock
10/18/16 Fall2016- Lecture#15 37
TotalCacheCapacity=Associativity× #ofsets× block_sizeBytes=blocks/set× sets× Bytes/block
ByteOffsetTag Index
C=N× S× B
address_size =tag_size +index_size +offset_size=tag_size +log2(S)+log2(B)
10/18/16 Fall2016- Lecture#15 38
TotalCacheCapacity=
39
Associativity*#ofsets*block_sizeBytes=blocks/set*sets*Bytes/block
ByteOffsetTag Index
C=N*S*B
address_size =tag_size +index_size +offset_size=tag_size +log2(S)+log2(B)
DoubletheAssociativity:Numberofsets?tag_size?index_size?#comparators?
DoubletheSets:Associativity?tag_size?index_size?#comparators?
10/18/16 Fall2016- Lecture#15
TotalCacheCapacity=
40
Associativity*#ofsets*block_sizeBytes=blocks/set*sets*Bytes/block
ByteOffsetTag Index
C=N*S*B
address_size =tag_size +index_size +offset_size=tag_size +log2(S)+log2(B)
DoubletheAssociativity:Halvethenumberofsetstag_size +1whileindex_size – 1,2xcomparators
DoubletheSets:Halvetheassociativitytag_size - 1whileindex_size +1,½xcomparators
10/18/16 Fall2016- Lecture#15
YourTurn• Foracacheof64blocks,eachblockfourbytesinsize:
1. Thecapacityofthecacheis:256 bytes.2. Givena2-waySetAssociativeorganization,thereare32
sets,eachof2 blocks,and2 placesablockfrommemorycouldbeplaced.
3. Givena4-waySetAssociativeorganization,thereare16setseachof4 blocksand4 placesablockfrommemorycouldbeplaced.
4. Givenan8-waySetAssociativeorganization,thereare8setseachof8 blocksand8 placesablockfrommemorycouldbeplaced.
10/18/16 Fall2016- Lecture#15 41
Clicker/PeerInstruction• ForSsets,Nways,Bblocks,whichstatementshold?
(i)ThecachehasBtags(ii)ThecacheneedsNcomparators(iii)B=NxS(iv)SizeofIndex=Log2(S)
A:(i)onlyB:(i)and(ii)onlyC:(i),(ii),(iii)onlyD:AllfourstatementsaretrueE:Nonearetrue
10/18/16 Fall2016- Lecture#15 42
CostsofSet-AssociativeCaches• N-wayset-associativecachecosts– Ncomparators(delayandarea)– MUXdelay(setselection)beforedataisavailable– Dataavailableaftersetselection(andHit/Missdecision).DM$:blockisavailablebeforetheHit/Missdecision• InSet-Associative,notpossibletojustassumeahitandcontinueandrecoverlaterifitwasamiss
• Whenmissoccurs,whichway’sblockselectedforreplacement?– LeastRecentlyUsed(LRU):onethathasbeenunusedthelongest(principleoftemporallocality)• Musttrackwheneachway’sblockwasusedrelativetootherblocksintheset
• For2-waySA$,onebitperset→setto1whenablockisreferenced;resettheotherway’sbit(i.e.,“lastused”)
10/18/16 Fall2016- Lecture#15 43
CacheReplacementPolicies• RandomReplacement
– Hardwarerandomlyselectsacacheevict• Least-RecentlyUsed
– Hardwarekeepstrackofaccesshistory– Replacetheentrythathasnotbeenusedforthelongesttime– For2-wayset-associativecache,needonebitforLRUreplacement
• ExampleofaSimple“Pseudo”LRUImplementation– Assume64FullyAssociativeentries– Hardwarereplacementpointerpointstoonecacheentry– Wheneveraccessismadetotheentrythepointerpointsto:
• Movethepointertothenextentry– Otherwise:donotmovethepointer– (exampleof“not-most-recentlyused”
replacementpolicy)
44
:
Entry0Entry1
Entry63
ReplacementPointer
10/18/16 Fall2016- Lecture#15
BenefitsofSet-AssociativeCaches• ChoiceofDM$versusSA$dependsonthecostofamiss
versusthecostofimplementation
• Largestgainsareingoingfromdirectmappedto2-way(20%+reductioninmissrate)
10/18/16 Fall2016- Lecture#15 45
Outline
• CacheOrganizationandPrinciples• WriteBackvs.WriteThrough• CachePerformance• CacheDesignTradeoffs• AndinConclusion…
10/18/16 Fall2016– Lecture#15 46
And inConclusion…
• NameoftheGame:ReduceAMAT–ReduceHitTime–ReduceMissRate–ReduceMissPenalty
• Balancecacheparameters(Capacity,associativity,blocksize)
10/18/16 Fall2016- Lecture#15 47