Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
CS110ComputerArchitecture
CachesPart1
Instructor:SörenSchwertfeger
http://shtech.org/courses/ca/
School of Information Science and Technology SIST
ShanghaiTech University
1Slides based on UC Berkley's CS61C
New-SchoolMachineStructures(It’sabitmorecomplicated!)
• ParallelRequestsAssignedtocomputere.g.,Search“Katz”
• ParallelThreadsAssignedtocoree.g.,Lookup,Ads
• ParallelInstructions>[email protected].,5pipelinedinstructions
• ParallelData>[email protected].,Addof4pairsofwords
• HardwaredescriptionsAllgates@onetime
• ProgrammingLanguages2
SmartPhone
WarehouseScale
Computer
SoftwareHardware
HarnessParallelism&AchieveHighPerformance
LogicGates
Core Core…
Memory(Cache)
Input/Output
Computer
CacheMemory
Core
InstructionUnit(s) FunctionalUnit(s)
A3+B3A2+B2A1+B1A0+B0
Howdoweknow?
Processor
Control
Datapath
ComponentsofaComputer
3
PC
Registers
Arithmetic&LogicUnit(ALU)
MemoryInput
Output
Bytes
Enable?Read/Write
Address
WriteData
ReadData
Processor-MemoryInterface I/O-MemoryInterfaces
Program
Data
Problem:Largememoriesslow?LibraryAnalogy
• Findingabookinalargelibrarytakestime– Takestimetosearchalargecardcatalog– (mappingtitle/authortoindexnumber)
– Round-triptimetowalktothestacksandretrievethedesiredbook.
• Largerlibrariesmakesbothdelaysworse• Electronicmemorieshavethesameissue,plusthetechnologiesthatweusetostoreanindividualbitgetslowerasweincreasedensity(SRAMversusDRAMversusMagneticDisk)
4Howeverwhatwewantisalargeyetfastmemory!
Processor-DRAMGap(latency)
5
Time
µProc60%/year
DRAM7%/year
1
10
100
100019
8019
81
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
DRAM
CPU
1982
Processor-MemoryPerformanceGap:(growing50%/yr)
Perfo
rmance
1980microprocessorexecutes~oneinstructioninsametimeasDRAMaccess2015microprocessorexecutes~1000instructionsinsametimeasDRAMaccess
SlowDRAMaccesscouldhavedisastrousimpactonCPUperformance!
BigIdea:MemoryHierarchyProcessor
Sizeofmemoryateachlevel
Increasingdistancefromprocessor,decreasingspeed
Level1
Level2
Leveln
Level3
...
Inner
Outer
Levelsinmemoryhierarchy
Aswemoveto outerlevelsthelatencygoesupandpriceperbitgoesdown.Why?
6
Whattodo:LibraryAnalogy• Wanttowriteareportusinglibrarybooks• Gotolibrary,lookuprelevantbooks,fetchfromstacks,andplaceondeskinlibrary
• Ifneedmore,checkthemoutandkeepondesk– Butdon’treturnearlierbookssincemightneedthem
• Youhopethiscollectionof~10booksondeskenoughtowritereport,despite10beingonlyatinyfractionofbooksavailable
7
RealMemoryReferencePatterns
Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. IBM Systems Journal 10(3): 168-192 (1971)
Time
Mem
oryAd
dress(on
edo
tperaccess)
BigIdea:Locality
• TemporalLocality(localityintime)– Gobacktosamebookondesktopmultipletimes– Ifamemorylocationisreferenced,thenitwilltendtobereferencedagainsoon
• SpatialLocality (localityinspace)– Whengotobookshelf,pickupmultiplebooksonJ.D.Salingersincelibrarystoresrelatedbookstogether
– Ifamemorylocationisreferenced,thelocationswithnearbyaddresseswilltendtobereferencedsoon
9
MemoryReferencePatterns
Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. IBM Systems Journal 10(3): 168-192 (1971)
Time
Mem
oryAd
dress(on
edo
tperaccess)
SpatialLocality
TemporalLocality
PrincipleofLocality
• PrincipleofLocality:Programsaccesssmallportionofaddressspaceatanyinstantoftime(spatiallocality)andrepeatedlyaccessthatportion(temporallocality)
• Whatprogramstructuresleadtotemporalandspatiallocalityininstruction accesses?
• Indata accesses?
11
MemoryReferencePatternsAddress
Time
Instructionfetches
Stackaccesses
Dataaccesses
nloopiterations
subroutinecall
subroutinereturn
argumentaccess
scalaraccesses
CachePhilosophy• Programmer-invisiblehardwaremechanismtogiveillusionofspeedoffastestmemorywithsizeoflargestmemory–Worksfineevenifprogrammerhasnoideawhatacacheis
– However,performance-orientedprogrammerstodaysometimes“reverseengineer”cachedesigntodesigndatastructurestomatchcache
13
MemoryAccesswithoutCache
• Loadwordinstruction:lw $t0,0($t1)• $t1contains1022ten,Memory[1022]=99
1. Processorissuesaddress1022tentoMemory2. Memoryreadswordataddress1022ten(99)3. Memorysends99toProcessor4. Processorloads99intoregister$t0
14
Processor
Control
Datapath
AddingCachetoComputer
15
PC
Registers
Arithmetic&LogicUnit(ALU)
MemoryInput
Output
Bytes
Enable?Read/Write
Address
WriteData
ReadData
Processor-MemoryInterface I/O-MemoryInterfaces
Program
Data
Cache
MemoryAccesswithCache• Loadwordinstruction:lw $t0,0($t1)• $t1contains1022ten,Memory[1022]=99• Withcache:Processorissuesaddress1022tentoCache1. Cachecheckstoseeifhascopyofdataataddress
1022ten2a. Iffindsamatch(Hit):cachereads99,sendstoprocessor2b. Nomatch(Miss):cachesendsaddress1022toMemory
I. Memoryreads99ataddress1022tenII. Memorysends99toCacheIII. Cachereplaceswordwithnew99IV. Cachesends99toprocessor
2. Processorloads99intoregister$t016
Cache“Tags”• Needwaytotellifhavecopyoflocationinmemorysothatcandecideonhitormiss
• Oncachemiss,putmemoryaddressofblockin“tagaddress”ofcacheblock1022placedintagnexttodatafrommemory(99)
17
Tag Data
252 121022 99131 72041 20
Fromearlierinstructions
Anatomyofa16ByteCache,4ByteBlock
• Operations:1. CacheHit2. CacheMiss3. Refillcachefrom
memory
• CacheneedsAddressTagstodecideifProcessorAddressisaCacheHitorCacheMiss– Comparesall4tags
18
Processor
32-bitAddress
32-bitData
Cache
32-bitAddress
32-bitData
Memory
1022 99252
720
12
1312041
Tag Data
252 121022 99131 72041 20
Tag Data
252 121022 99511 112041 20
CacheReplacement• Supposeprocessornowrequestslocation511,whichcontains11?
• Doesn’tmatchanycacheblock,somust“evict”oneresidentblocktomakeroom– Whichblocktoevict?
• Replace“victim”withnewmemoryblockataddress511
19
BlockMustbeAlignedinMemory
• Wordblocksarealigned,sobinaryaddressofallwordsincachealwaysendsin00two
• Howtotakeadvantageofthistosavehardwareandenergy?
• Don’tneedtocomparelast2bitsof32-bitbyteaddress(comparatorcanbenarrower)
=>Don’tneedtostorelast2bitsof32-bitbyteaddressinCacheTag(Tagcanbenarrower)
20
Anatomyofa32BCache,8BBlock
21
• Blocksmustbealignedinpairs,otherwisecouldgetsamewordtwiceincache
Ø Tagsonlyhaveeven-numberedwords
Ø Last3bitsofaddressalways000two
Ø Tags,comparatorscanbenarrower
• Cangethitforeitherwordinblock
Processor
32-bitAddress
32-bitData
Cache
32-bitAddress
32-bitData
Memory
1022 99252
421947
12
1302040
1000720
-10
HardwareCostofCache
• NeedtocompareeverytagtotheProcessoraddress
• Comparatorsareexpensive
• Optimization:use2“sets”ofdatawithatotalofonly2comparators
• 1Addressbitselectswhichset
• Compareonlytagsfromselectedset
• Generalizetomoresets2222
Processor
32-bitAddress
Tag Data
32-bitData
Cache32-bitAddress
32-bitData
Memory
Tag Data
Set0
Set1
Tag Data
Tag Data
ProcessorAddressFieldsusedbyCacheController
• BlockOffset:Byteaddresswithinblock• SetIndex:Selectswhichset• Tag:Remainingportionofprocessoraddress
• SizeofIndex=log2(numberofsets)• SizeofTag=Addresssize– SizeofIndex– log2(numberofbytes/block)
Block offsetSetIndexTag
23
ProcessorAddress(32-bitstotal)
Whatislimittonumberofsets?• Foragiventotalnumberofblocks,wecansavemorecomparatorsifhavemorethan2sets
• Limit:AsManySetsasCacheBlocks=>onlyoneblockperset– onlyneedsonecomparator!
• Called“Direct-Mapped”Design
24
Block offsetIndexTag
DirectMappedCacheEx:Mappinga6-bitMemoryAddress
• Inexample,blocksizeis4bytes/1word• Memoryandcacheblocksalwaysthesamesize,unitoftransferbetween
memoryandcache• #Memoryblocks>>#Cacheblocks
– 16Memoryblocks=16words=64bytes=>6bitstoaddressallbytes– 4Cacheblocks,4bytes(1word)perblock– 4Memoryblocksmaptoeachcacheblock
• Memoryblocktocacheblock,akaindex:middletwobits• Whichmemoryblockisinagivencacheblock,akatag:toptwobits
25
05 1
ByteWithinBlock
ByteOffset
23
BlockWithin$
4
Mem BlockWithin$Block
Tag Index
OneMoreDetail:ValidBit
• Whenstartanewprogram,cachedoesnothavevalidinformationforthisprogram
• Needanindicatorwhetherthistagentryisvalidforthisprogram
• Adda“validbit”tothecachetagentry0=>cachemiss,evenifbychance,address=tag1=>cachehit,ifprocessoraddress=tag
26
Caching:ASimpleFirstExample
00011011
Cache
MainMemory
Q:Whereinthecacheisthemem block?
Usenext2low-ordermemoryaddressbits–theindex– todeterminewhichcacheblock(i.e.,modulothenumberofblocksinthecache)
Tag Data
Q:Isthememoryblockincache?Comparethecachetagtothehigh-order2memoryaddressbitstotellifthememoryblockisinthecache(providedvalidbitisset)
Valid
0000xx0001xx0010xx0011xx0100xx0101xx0110xx0111xx1000xx1001xx1010xx1011xx1100xx1101xx1110xx1111xx
OnewordblocksTwoloworderbits(xx)definethebyteintheblock(32bwords)
Index
27
• Onewordblocks,cachesize=1Kwords(or4KB)
Direct-MappedCacheExample
20Tag 10Index
DataIndex TagValid012...
102110221023
3130... 131211... 210Byteoffset
Whatkindoflocalityarewetakingadvantageof?
20
Data
32
Hit
28
Validbitensures
somethingusefulincacheforthisindex
CompareTagwith
upperpartofAddresstoseeifaHit
Readdatafromcacheinstead
ofmemoryifaHit
Comparator
• Fourwords/block,cachesize=1Kwords
Multiword-BlockDirect-MappedCache
8Index
2
DataIndex TagValid012...
253254255
3130... 1312 11...4 3210 Byteoffset
20
20Tag
Hit Data
32
Wordoffset
Whatkindoflocalityarewetakingadvantageof?29
CacheNamesforEachOrganization• “FullyAssociative”:Blockcangoanywhere– Firstdesigninlecture– Note:NoIndexfield,but1comparator/block
• “DirectMapped”:Blockgoesoneplace– Note:Only1comparator– Numberofsets=numberblocks
• “N-waySetAssociative”:Nplacesforablock– Numberofsets=numberofblocks/N– Ncomparators– FullyAssociative:N=numberofblocks– DirectMapped:N=1
30
RangeofSet-AssociativeCaches• Forafixed-sizecache,andagivenblocksize,eachincreasebyafactorof 2inassociativitydoublesthenumberofblocksperset(i.e.,thenumberof“ways”)andhalvesthenumberofsets–• decreasesthesizeoftheindexby1bitandincreasesthesizeofthetagby1bit
31
Block offsetIndexTag
MoreAssociativity(moreways)
Whatifwecanalsochangetheblocksize?
Question• Foracachewithconstanttotalcapacity, ifweincreasethenumberofwaysbyafactorof2,whichstatementisfalse:
• A:Thenumberofsetscouldbedoubled• B:Thetagwidthcoulddecrease• C:Theblocksizecouldstaythesame• D:Theblocksizecouldbehalved• E:Tagwidthmustincrease
32
TotalCashCapacity=
33
Associativity*#ofsets*block_sizeBytes=blocks/set*sets*Bytes/block
ByteOffsetTag Index
C=N*S*B
address_size =tag_size +index_size +offset_size=tag_size +log2(S)+log2(B)
ClickerQuestion:Cremainsconstant,Sand/orBcanchangesuchthatC=2N*(SB)’=>(SB)’=SB/2
Tag_size =address_size – (log2(S)+log2(B))=address_size – log2(SB)=address_size – (log2(SB)– 1)
Second-LevelCache(SRAM)
TypicalMemoryHierarchyControl
Datapath
SecondaryMemory(Disk
OrFlash)
On-ChipComponents
RegFile
MainMemory(DRAM)Data
CacheInstrCache
Speed(cycles):½’s 1’s 10’s 100’s 1,000,000’s
Size(bytes): 100’s 10K’sM’sG’sT’s
34
• Principleoflocality+memoryhierarchypresentsprogrammerwith≈asmuchmemoryasisavailableinthecheapest technologyatthe≈speedofferedbythefastest technology
Cost/bit:highest lowest
Third-LevelCache(SRAM)
Inthenews:Intel3DXpoint• 375GB(2nd half20171.5TB)• In2015announcedas”1000timesfasterthanSSD”
• 500.000IOPS(very good value compared to SSD)• very low latency (40timesfaster than SSD)• ForDesktops:16and 32GB(44and 80USD)
35
36
• TransparentlyintegratesintothememorysubsystemandmakestheSSDappearlikeDRAMtotheOSandapplications
• Upto8xmemoryextension• Lowlatencyandultra-highendurance