View
4
Download
0
Category
Preview:
Citation preview
Lecture10:MemoryHierarchy-- MemoryTechnologyandPrincipalofLocality
CSCE513ComputerArchitecture
1
DepartmentofComputerScienceandEngineeringYonghong Yan
yanyh@cse.sc.eduhttps://passlab.github.io/CSCE513
TopicsforMemoryHierarchy
• MemoryTechnologyandPrincipalofLocality– CAQA:2.1,2.2,B.1– COD:5.1,5.2
• CacheOrganizationandPerformance– CAQA:B.1,B.2– COD:5.2,5.3
• CacheOptimization– 6BasicCacheOptimizationTechniques
• CAQA:B.3– 10AdvancedOptimizationTechniques
• CAQA:2.3• VirtualMemoryandVirtualMachine
– CAQA:B.4,2.4;COD:5.6,5.7– Skipforthiscourse
2
TheBigPicture:WhereareWeNow?
• Memorysystem– Supplyingdataontimeforcomputation(speed)– Largeenoughtoholdeverythingneeded(capacity)
Control
Datapath
Memory
ProcessorInput
Output
3
Overview
• Programmerswantunlimitedamountsofmemorywithlowlatency
• Fastmemorytechnologyismoreexpensiveperbitthanslowermemory
• Solution:organizememorysystemintoahierarchy– Entireaddressablememoryspaceavailableinlargest,slowest
memory– Incrementallysmallerandfastermemories,eachcontaininga
subsetofthememorybelowit,proceedinstepsuptowardtheprocessor
• Temporalandspatiallocalityinsuresthatnearlyallreferencescanbefoundinsmallermemories– Givestheallusionofalarge,fastmemorybeingpresentedtothe
processor4
MemoryTechnology
• RandomAccess:accesstimeisthesameforalllocations• DRAM:DynamicRandomAccessMemory
– Highdensity,lowpower,cheap,slow– Dynamic:needtobe“refreshed”regularly– 50ns– 70ns,$20– $75perGB
• SRAM:StaticRandomAccessMemory– Lowdensity,highpower,expensive,fast– Static:contentwilllast“forever”(untillosepower)– 0.5ns– 2.5ns,$2000– $5000perGB
• Magneticdisk– 5ms– 20ms,$0.20– $2perGB
Ideal memory: • Access time of SRAM• Capacity and cost/GB of
disk6
StaticRAM(SRAM)6-TransistorCell– 1Bit
• Write:1.Drivebitlines(bit=1,bit=0)2..Selectrow
• Read:1.Precharge bitandbittoVdd orVdd/2=>makesureequal!2..Selectrow3.Cellpullsonelinelow4.Senseamponcolumndetectsdifferencebetweenbitandbit
6-Transistor SRAM Cell
bit bit
word(row select)
bit bit
word
replaced with pullupto save area
10
0 1
7
DynamicRAM(DRAM)1-TransistorMemoryCell
• Write:– 1.Drivebitline– 2.Selectrow
• Read:– 1.Precharge bitlinetoVdd– 2.Selectrow– 3.Cellandbitlinesharecharges
• Verysmallvoltagechangesonthebitline– 4.Sense(fancysenseamp)
• Candetectchangesof~1millionelectrons– 5.Write:restorethevalue
• Refresh– 1.Justdoadummyreadtoeverycell.
row select
bit
8
Performance:LatencyandBandwidth
• PerformanceofMainMemory:– Latency:CacheMissPenalty
• AccessTime:timebetweenrequestandwordarrives• CycleTime:timebetweenrequests
– Bandwidth:I/O&LargeBlockMissPenalty(L2)• MainMemoryisDRAM: DynamicRandomAccessMemory
– Needstoberefreshedperiodically(8ms)– Addressesdividedinto2halves(Memoryasa2Dmatrix):
• RAS orRowAccessStrobe andCAS orColumnAccessStrobe• CacheusesSRAM: StaticRandomAccessMemory
– Norefresh(6transistors/bitvs.1transistor)Size:DRAM/SRAM 4-8Cost/Cycletime:SRAM/DRAM 8-16
9
FlashMemory
• TypeofEEPROM• Types:NAND(denser)andNOR(faster)• NANDFlash:
– Readsaresequential,readsentirepage(.5to4KiB)– 25usforfirstbyte,40MiB/sforsubsequentbytes– SDRAM:40nsforfirstbyte,4.8GB/sforsubsequentbytes– 2KiBtransfer:75uSvs500nsforSDRAM,150Xslower– 300to500Xfasterthanmagneticdisk
11
NANDFlashMemory
• Mustbeerased(inblocks)beforebeingoverwritten• Nonvolatile,canuseaslittleaszeropower• Limitednumberofwritecycles(~100,000)• $2/GiB,comparedto$20-40/GiBforSDRAMand$0.09GiBformagneticdisk
• Phase-Change/MemristerMemory– Possibly10Ximprovementinwriteperformanceand2X
improvementinreadperformance
12
CPU-MemoryPerformanceGap:Latency
CPU-DRAMMemoryLatency GapàMemoryWall
Processor-MemoryPerformance Gap:(grows 50% / year)
13
CPU-MemoryPerformanceGap:Bandwidth
• Memoryhierarchydesignbecomesmorecrucialwithrecentmulti-coreprocessors:
• Aggregatepeakbandwidthgrowswith#cores:– IntelCorei7cangeneratetworeferencespercoreperclock– Fourcoresand3.2GHzclock
• 25.6billion64-bitdatareferences/second+• 12.8billion128-bitinstructionreferences/second• =409.6GB/s!
– DRAMbandwidthisonly8%ofthis(34.1GB/s)• Requires:
– Multi-port,pipelinedcaches– Twolevelsofcachepercore– Sharedthird-levelcacheonchip
14
MemoryHierarchy
• Keepmostrecentaccesseddataanditsadjacentdatainthesmaller/fastercachesthatareclosertoprocessor
• Mechanismsforreplacingdata
15
Control
Datapath
SecondaryStorage(Disk)
Processor
Registers
MainMemory(DRAM)
2nd/3rd
LevelCache
(SRAM)
On-C
hipC
ache
1s 10,000,000s (10s ms)
Speed (ns): 10s 100s
100s GsSize (bytes): Ks Ms
TertiaryStorage(Tape)
10,000,000,000s (10s sec)
Ts
WhyHierarchyWorks
• ThePrincipleofLocality:– Programaccessarelativelysmallportionoftheaddressspaceatanyinstantoftime.
Address Space0 2^n - 1
Probabilityof reference
16
ThePrincipleofLocality• Programstendtoreusedataandinstructionsnearthosetheyhaveusedrecently,orthatwererecentlyreferencedthemselves
• Spatiallocality: Itemswithnearbyaddressestendtobereferencedclosetogetherintime
• Temporallocality: Recentlyreferenceditemsarelikelytobereferencedinthenearfuture
• Data–Referencearrayelementsinsuccession(stride-1referencepattern):SpatialLocality
–Referencesumeachiteration:TemporalLocality• Instructions
–Referenceinstructionsinsequence:SpatialLocality–Cyclethroughlooprepeatedly:TemporalLocality
sum = 0;for(i=0; i<n; i++)sum += a[i];
return sum;
17
MemoryHierarchyofaComputerSystem• Bytakingadvantageoftheprincipleoflocality:
– Presenttheuserwithasmuchmemoryasisavailableinthecheapesttechnology.
– Provideaccessatthespeedofferedbythefastesttechnology.
Control
Datapath
SecondaryStorage(Disk)
Processor
Registers
MainMemory(DRAM)
2nd/3rd
LevelCache
(SRAM)
On-C
hipC
ache
1s 10,000,000s (10s ms)
Speed (ns): 10s 100s
100s GsSize (bytes): Ks Ms
TertiaryStorage(Tape)
10,000,000,000s (10s sec)
Ts18
Vector/MatrixandArrayinC
• Chasrow-majorstorageformultipledimensionalarray– A[2,2]isfollowedbyA[2,3]
• 3-dimensionalarray– B[3][100][100]
19
int A[4][4]
• Steppingthroughcolumnsinonerow:for (i=0;i<4;i++)sum+=A[0][i];accessessuccessiveelements
• Steppingthroughrowsinonecolumn:for (i=0;i<4;i++)sum+=A[i][0];Stride-4access
LocalityExample
• Claim: Beingabletolookatcodeandgetqualitativesenseofitslocalityiskeyskillforprofessionalprogrammer
• Question: Doesthisfunctionhavegoodlocality?
int sumarrayrows(int a[M][N]){int i, j, sum = 0;
for (i = 0; i < M; i++)for (j = 0; j < N; j++)
sum += a[i][j];return sum;
}
20
✅
LocalityExample
• Question: Doesthisfunctionhavegoodlocality?
int sumarraycols(int a[M][N]){int i, j, sum = 0;
for (j = 0; j < N; j++)for (i = 0; i < M; i++)
sum += a[i][j];return sum;
}
21
❌
LocalityExample
• Question: Canyoupermutetheloopssothatthefunctionscansthe3-darraya[] withastride-1referencepattern(andthushasgoodspatiallocality)?
int sumarray3d(int a[M][N][N]) {int i, j, k, sum = 0;
for (i = 0; i < N; i++)for (j = 0; j < N; j++)
for (k = 0; k < M; k++)sum += a[k][i][j];
return sum;}
22
Review:MemoryTechnologyandHierarchy
23
Technology Challenge: Memory Wall
Address Space0 2^n - 1Prob
abili
tyof
refe
renc
e
Program Behavior: Principle of Locality
Architecture Approach: Memory Hierarchy
int sumarrayrows(int a[M][N]){int i, j, sum = 0;
for (i = 0; i < M; i++)for (j = 0; j < N; j++)sum += a[i][j];
return sum;}
Your Code: Exploit Locality, and Work With Memory Storage Type
MemoryHierarchy
24
• capacity:Register<<SRAM<<DRAM• latency:Register<<SRAM<<DRAM• bandwidth:on-chip>>off-chip
Onadataaccess:ifdataÎ fastmemoryÞ lowlatencyaccess(SRAM)ifdataÏ fastmemoryÞ highlatencyaccess(DRAM)
History:FurtherBack
Ideallyonewoulddesireanindefinitelylargememorycapacitysuchthatanyparticular...wordwouldbeimmediatelyavailable....Weare...forcedtorecognizethepossibilityofconstructinga hierarchyofmemories,eachofwhichhasgreatercapacitythantheprecedingbutwhichislessquicklyaccessible.A.W.Burks,H.H.Goldstine,andJ.vonNeumannPreliminaryDiscussionoftheLogicalDesignofanElectronic
ComputingInstrument,1946
26
Next:MemoryHierarchy- CacheOrganization
• Keepmostrecentaccesseddataanditsadjacentdatainthesmaller/fastercachesthatareclosertoprocessor
• Mechanismsforreplacingdata
27
Control
Datapath
SecondaryStorage(Disk)
Processor
Registers
MainMemory(DRAM)
2nd/3rd
LevelCache
(SRAM)
On-C
hipC
ache
1s 10,000,000s (10s ms)
Speed (ns): 10s 100s
100s GsSize (bytes): Ks Ms
TertiaryStorage(Tape)
10,000,000,000s (10s sec)
Ts
Sourcesoflocality
• Temporallocality– Codewithinaloop– Sameinstructionsfetchedrepeatedly
• Spatiallocality– Dataarrays– Localvariablesinstack– Dataallocatedinchunks(contiguousbytes)
for(i=0;i<N;i++){A[i]=B[i]+C[i]*a;
}
29
int sumarrayrows(int a[M][N]){
int i, j, sum = 0;
for (i = 0; i < M; i++)for (j = 0; j < N; j++)
sum += a[i][j];return sum;
}
int sumarraycols(int a[M][N]){
int i, j, sum = 0;
for (j = 0; j < N; j++)for (i = 0; i < M; i++)
sum += a[i][j];return sum;
}
Miss rate = 1/4 = 25% Miss rate = 100%
WritingCacheFriendlyCode
• Repeatedreferencestovariablesaregood(temporallocality)• Stride-1referencepatternsaregood(spatiallocality)• Examples:
– coldcache,4-bytewords,4-wordcacheblocks
30
MatrixMultiplicationExample
• Majorcacheeffectstoconsider– Totalcachesize
• Exploittemporallocalityandblocking)– Blocksize
• Exploitspatiallocality
• Description:– MultiplyNxNmatrices– O(N3)totaloperations– Accesses
• Nreadspersourceelement• Nvaluessummedperdestination
– butmaybeabletoholdinregister
/* ijk */for (i=0; i<n; i++) {
for (j=0; j<n; j++) {sum = 0.0;for (k=0; k<n; k++)
sum += a[i][k] * b[k][j];c[i][j] = sum;
}}
Variable sumheld in register
31
MissRateAnalysisforMatrixMultiply
• Assume:– Cachelinesize=32Bytes(bigenoughfor464-bitwords)– Matrixdimension(N)isverylarge
• Approximate1/Nas0.0– Cacheisnotevenbigenoughtoholdmultiplerows
• Analysismethod:– Lookataccesspatternofinnerloop
32
SummaryofMatrixMultiplication
39
for (i=0; i<n; i++) {for (j=0; j<n; j++) {
sum = 0.0;for (k=0; k<n; k++)
sum += a[i][k] *b[k][j]; c[i][j] = sum;
}
}
ijk (& jik): kij (& ikj): jki (& kji):• 2 loads, 0 stores• misses/iter = 1.25
for (k=0; k<n; k++) {for (i=0; i<n; i++) {
r = a[i][k];for (j=0; j<n; j++)
c[i][j] += r * b[k][j];
}}
for (j=0; j<n; j++) {for (k=0; k<n; k++) {
r = b[k][j];for (i=0; i<n; i++)
c[i][j] += a[i][k] * r;}
}
• 2 loads, 1 store• misses/iter = 0.5
• 2 loads, 1 store• misses/iter = 2.0
Recommended