Lecture 10: Memory Hierarchy - GitHub Pages

Lecture10:MemoryHierarchy-- MemoryTechnologyandPrincipalofLocality

CSCE513ComputerArchitecture

DepartmentofComputerScienceandEngineeringYonghong Yan

yanyh@cse.sc.eduhttps://passlab.github.io/CSCE513

TopicsforMemoryHierarchy

• MemoryTechnologyandPrincipalofLocality– CAQA:2.1,2.2,B.1– COD:5.1,5.2

• CacheOrganizationandPerformance– CAQA:B.1,B.2– COD:5.2,5.3

• CacheOptimization– 6BasicCacheOptimizationTechniques

• CAQA:B.3– 10AdvancedOptimizationTechniques

• CAQA:2.3• VirtualMemoryandVirtualMachine

– CAQA:B.4,2.4;COD:5.6,5.7– Skipforthiscourse

TheBigPicture:WhereareWeNow?

• Memorysystem– Supplyingdataontimeforcomputation(speed)– Largeenoughtoholdeverythingneeded(capacity)

Control

Datapath

Memory

ProcessorInput

Output

Overview

• Programmerswantunlimitedamountsofmemorywithlowlatency

• Fastmemorytechnologyismoreexpensiveperbitthanslowermemory

• Solution:organizememorysystemintoahierarchy– Entireaddressablememoryspaceavailableinlargest,slowest

memory– Incrementallysmallerandfastermemories,eachcontaininga

subsetofthememorybelowit,proceedinstepsuptowardtheprocessor

• Temporalandspatiallocalityinsuresthatnearlyallreferencescanbefoundinsmallermemories– Givestheallusionofalarge,fastmemorybeingpresentedtothe

processor4

MemoryHierarchy

MemoryTechnology

• RandomAccess:accesstimeisthesameforalllocations• DRAM:DynamicRandomAccessMemory

– Highdensity,lowpower,cheap,slow– Dynamic:needtobe“refreshed”regularly– 50ns– 70ns,$20– $75perGB

• SRAM:StaticRandomAccessMemory– Lowdensity,highpower,expensive,fast– Static:contentwilllast“forever”(untillosepower)– 0.5ns– 2.5ns,$2000– $5000perGB

• Magneticdisk– 5ms– 20ms,$0.20– $2perGB

Ideal memory: • Access time of SRAM• Capacity and cost/GB of

StaticRAM(SRAM)6-TransistorCell– 1Bit

• Write:1.Drivebitlines(bit=1,bit=0)2..Selectrow

• Read:1.Precharge bitandbittoVdd orVdd/2=>makesureequal!2..Selectrow3.Cellpullsonelinelow4.Senseamponcolumndetectsdifferencebetweenbitandbit

6-Transistor SRAM Cell

bit bit

word(row select)

bit bit

replaced with pullupto save area

DynamicRAM(DRAM)1-TransistorMemoryCell

• Write:– 1.Drivebitline– 2.Selectrow

• Read:– 1.Precharge bitlinetoVdd– 2.Selectrow– 3.Cellandbitlinesharecharges

• Verysmallvoltagechangesonthebitline– 4.Sense(fancysenseamp)

• Candetectchangesof~1millionelectrons– 5.Write:restorethevalue

• Refresh– 1.Justdoadummyreadtoeverycell.

row select

Performance:LatencyandBandwidth

• PerformanceofMainMemory:– Latency:CacheMissPenalty

• AccessTime:timebetweenrequestandwordarrives• CycleTime:timebetweenrequests

– Bandwidth:I/O&LargeBlockMissPenalty(L2)• MainMemoryisDRAM: DynamicRandomAccessMemory

– Needstoberefreshedperiodically(8ms)– Addressesdividedinto2halves(Memoryasa2Dmatrix):

• RAS orRowAccessStrobe andCAS orColumnAccessStrobe• CacheusesSRAM: StaticRandomAccessMemory

– Norefresh(6transistors/bitvs.1transistor)Size:DRAM/SRAM 4-8Cost/Cycletime:SRAM/DRAM 8-16

Stacked/EmbeddedDRAMs

• StackedDRAMsinsamepackageasprocessor– HighBandwidthMemory(HBM)

FlashMemory

• TypeofEEPROM• Types:NAND(denser)andNOR(faster)• NANDFlash:

– Readsaresequential,readsentirepage(.5to4KiB)– 25usforfirstbyte,40MiB/sforsubsequentbytes– SDRAM:40nsforfirstbyte,4.8GB/sforsubsequentbytes– 2KiBtransfer:75uSvs500nsforSDRAM,150Xslower– 300to500Xfasterthanmagneticdisk

NANDFlashMemory

• Mustbeerased(inblocks)beforebeingoverwritten• Nonvolatile,canuseaslittleaszeropower• Limitednumberofwritecycles(~100,000)• $2/GiB,comparedto$20-40/GiBforSDRAMand$0.09GiBformagneticdisk

• Phase-Change/MemristerMemory– Possibly10Ximprovementinwriteperformanceand2X

improvementinreadperformance

CPU-MemoryPerformanceGap:Latency

CPU-DRAMMemoryLatency GapàMemoryWall

Processor-MemoryPerformance Gap:(grows 50% / year)

CPU-MemoryPerformanceGap:Bandwidth

• Memoryhierarchydesignbecomesmorecrucialwithrecentmulti-coreprocessors:

• Aggregatepeakbandwidthgrowswith#cores:– IntelCorei7cangeneratetworeferencespercoreperclock– Fourcoresand3.2GHzclock

• 25.6billion64-bitdatareferences/second+• 12.8billion128-bitinstructionreferences/second• =409.6GB/s!

– DRAMbandwidthisonly8%ofthis(34.1GB/s)• Requires:

– Multi-port,pipelinedcaches– Twolevelsofcachepercore– Sharedthird-levelcacheonchip

MemoryHierarchy

• Keepmostrecentaccesseddataanditsadjacentdatainthesmaller/fastercachesthatareclosertoprocessor

• Mechanismsforreplacingdata

Control

Datapath

SecondaryStorage(Disk)

Processor

Registers

MainMemory(DRAM)

2nd/3rd

LevelCache

(SRAM)

1s 10,000,000s (10s ms)

Speed (ns): 10s 100s

100s GsSize (bytes): Ks Ms

TertiaryStorage(Tape)

10,000,000,000s (10s sec)

WhyHierarchyWorks

• ThePrincipleofLocality:– Programaccessarelativelysmallportionoftheaddressspaceatanyinstantoftime.

Address Space0 2^n - 1

Probabilityof reference

ThePrincipleofLocality• Programstendtoreusedataandinstructionsnearthosetheyhaveusedrecently,orthatwererecentlyreferencedthemselves

• Spatiallocality: Itemswithnearbyaddressestendtobereferencedclosetogetherintime

• Temporallocality: Recentlyreferenceditemsarelikelytobereferencedinthenearfuture

• Data–Referencearrayelementsinsuccession(stride-1referencepattern):SpatialLocality

–Referencesumeachiteration:TemporalLocality• Instructions

–Referenceinstructionsinsequence:SpatialLocality–Cyclethroughlooprepeatedly:TemporalLocality

sum = 0;for(i=0; i<n; i++)sum += a[i];

return sum;

MemoryHierarchyofaComputerSystem• Bytakingadvantageoftheprincipleoflocality:

– Presenttheuserwithasmuchmemoryasisavailableinthecheapesttechnology.

– Provideaccessatthespeedofferedbythefastesttechnology.

Control

Datapath

Processor

Registers

MainMemory(DRAM)

2nd/3rd

LevelCache

(SRAM)

1s 10,000,000s (10s ms)

10,000,000,000s (10s sec)

Vector/MatrixandArrayinC

• Chasrow-majorstorageformultipledimensionalarray– A[2,2]isfollowedbyA[2,3]

• 3-dimensionalarray– B[3][100][100]

int A[4][4]

• Steppingthroughcolumnsinonerow:for (i=0;i<4;i++)sum+=A[0][i];accessessuccessiveelements

• Steppingthroughrowsinonecolumn:for (i=0;i<4;i++)sum+=A[i][0];Stride-4access

LocalityExample

• Claim: Beingabletolookatcodeandgetqualitativesenseofitslocalityiskeyskillforprofessionalprogrammer

• Question: Doesthisfunctionhavegoodlocality?

int sumarrayrows(int a[M][N]){int i, j, sum = 0;

for (i = 0; i < M; i++)for (j = 0; j < N; j++)

sum += a[i][j];return sum;

LocalityExample

• Question: Doesthisfunctionhavegoodlocality?

int sumarraycols(int a[M][N]){int i, j, sum = 0;

for (j = 0; j < N; j++)for (i = 0; i < M; i++)

LocalityExample

• Question: Canyoupermutetheloopssothatthefunctionscansthe3-darraya[] withastride-1referencepattern(andthushasgoodspatiallocality)?

int sumarray3d(int a[M][N][N]) {int i, j, k, sum = 0;

for (i = 0; i < N; i++)for (j = 0; j < N; j++)

for (k = 0; k < M; k++)sum += a[k][i][j];

return sum;}

Review:MemoryTechnologyandHierarchy

Technology Challenge: Memory Wall

Address Space0 2^n - 1Prob

Program Behavior: Principle of Locality

Architecture Approach: Memory Hierarchy

int sumarrayrows(int a[M][N]){int i, j, sum = 0;

for (i = 0; i < M; i++)for (j = 0; j < N; j++)sum += a[i][j];

return sum;}

Your Code: Exploit Locality, and Work With Memory Storage Type

MemoryHierarchy

• capacity:Register<<SRAM<<DRAM• latency:Register<<SRAM<<DRAM• bandwidth:on-chip>>off-chip

Onadataaccess:ifdataÎ fastmemoryÞ lowlatencyaccess(SRAM)ifdataÏ fastmemoryÞ highlatencyaccess(DRAM)

History:Alpha21164(1994)

https://en.wikipedia.org/wiki/Alpha_21164

History:FurtherBack

Ideallyonewoulddesireanindefinitelylargememorycapacitysuchthatanyparticular...wordwouldbeimmediatelyavailable....Weare...forcedtorecognizethepossibilityofconstructinga hierarchyofmemories,eachofwhichhasgreatercapacitythantheprecedingbutwhichislessquicklyaccessible.A.W.Burks,H.H.Goldstine,andJ.vonNeumannPreliminaryDiscussionoftheLogicalDesignofanElectronic

ComputingInstrument,1946

Next:MemoryHierarchy- CacheOrganization

• Keepmostrecentaccesseddataanditsadjacentdatainthesmaller/fastercachesthatareclosertoprocessor

• Mechanismsforreplacingdata

Control

Datapath

Processor

Registers

MainMemory(DRAM)

2nd/3rd

LevelCache

(SRAM)

1s 10,000,000s (10s ms)

10,000,000,000s (10s sec)

MoreExamplesforLocalityDiscussion

Sourcesoflocality

• Temporallocality– Codewithinaloop– Sameinstructionsfetchedrepeatedly

• Spatiallocality– Dataarrays– Localvariablesinstack– Dataallocatedinchunks(contiguousbytes)

for(i=0;i<N;i++){A[i]=B[i]+C[i]*a;

int sumarrayrows(int a[M][N]){

int i, j, sum = 0;

for (i = 0; i < M; i++)for (j = 0; j < N; j++)

int sumarraycols(int a[M][N]){

int i, j, sum = 0;

for (j = 0; j < N; j++)for (i = 0; i < M; i++)

Miss rate = 1/4 = 25% Miss rate = 100%

WritingCacheFriendlyCode

• Repeatedreferencestovariablesaregood(temporallocality)• Stride-1referencepatternsaregood(spatiallocality)• Examples:

– coldcache,4-bytewords,4-wordcacheblocks

MatrixMultiplicationExample

• Majorcacheeffectstoconsider– Totalcachesize

• Exploittemporallocalityandblocking)– Blocksize

• Exploitspatiallocality

• Description:– MultiplyNxNmatrices– O(N3)totaloperations– Accesses

• Nreadspersourceelement• Nvaluessummedperdestination

– butmaybeabletoholdinregister

/* ijk */for (i=0; i<n; i++) {

for (j=0; j<n; j++) {sum = 0.0;for (k=0; k<n; k++)

sum += a[i][k] * b[k][j];c[i][j] = sum;

Variable sumheld in register

MissRateAnalysisforMatrixMultiply

• Assume:– Cachelinesize=32Bytes(bigenoughfor464-bitwords)– Matrixdimension(N)isverylarge

• Approximate1/Nas0.0– Cacheisnotevenbigenoughtoholdmultiplerows

• Analysismethod:– Lookataccesspatternofinnerloop

MatrixMultiplication(ijk)

MatrixMultiplication(jik)

MatrixMultiplication(kij)

MatrixMultiplication(ikj)

MatrixMultiplication(jki)

MatrixMultiplication(kji)

SummaryofMatrixMultiplication

for (i=0; i<n; i++) {for (j=0; j<n; j++) {

sum = 0.0;for (k=0; k<n; k++)

sum += a[i][k] *b[k][j]; c[i][j] = sum;

ijk (& jik): kij (& ikj): jki (& kji):• 2 loads, 0 stores• misses/iter = 1.25

for (k=0; k<n; k++) {for (i=0; i<n; i++) {

r = a[i][k];for (j=0; j<n; j++)

c[i][j] += r * b[k][j];

for (j=0; j<n; j++) {for (k=0; k<n; k++) {

r = b[k][j];for (i=0; i<n; i++)

c[i][j] += a[i][k] * r;}

• 2 loads, 1 store• misses/iter = 0.5

• 2 loads, 1 store• misses/iter = 2.0

Lecture 10: Memory Hierarchy - GitHub Pages

Documents

Lecture6 memory hierarchy

Memory Hierarchy II

Memory Hierarchy Design

Compressed Memory Hierarchy

Memory Sub-System CT101 – Computing Systems. Memory Subsystem Memory Hierarchy Types of memory Memory organization Memory Hierarchy Design Cache

1 Recap: Memory Hierarchy. 2 Memory Hierarchy - the Big Picture Problem: memory is too slow and or too small Solution: memory hierarchy Fastest Slowest

Memory Hierarchy - Colorado State Universitycs270/.Fall15/slides/... · Fast: Exploiting Memory Hierarchy — 9 Characteristics of the Memory Hierarchy Increasing distance from the

Memory Organization (Memory Hierarchy)ggn.dronacharya.info/ECEDept/Downloads/QuestionBank/Vsem/... · 2013. 10. 15. · Memory Organization (Memory Hierarchy) Memory hierarchy in

The Memory Hierarchy

Memory Types & Hierarchy

Lecture 11: Memory Hierarchy - GitHub PagesTerminology for Memory Hierarchy Performance •Hit: Data appears in some block in upper level (example: Block X) – Hit Rate: the fraction

Memory Hierarchy How to improve memory access. Outline Locality Structure of memory hierarchy Cache Virtual memory

12 memory hierarchy

9a. System Memory-Memory Hierarchy

Memory Hierarchy 2

Computer Memory Hierarchy

Chapter-5 Memory Hierarchy Design - VTU notes€¦ · Chapter-5 Memory Hierarchy Design • Unlimited amount of fast memory - Economical solution is memory hierarchy - Locality -

Memory Hierarchy

Memory Hierarchy, Caching, Virtual Memory

Memory Hierarchy - cihlab.github.io