Goal for this course AVDARK on the web - Uppsala … computer science lab 1985-1988 ÎNetInsight, ... ≈30 APZ 212 @ 5MH30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer

Welcome to AVDARK

E ik H tErik HagerstenUppsala Universitypp y

Ericsson, CPU designer 1982-84: APZ212MIT 1984-85: Dataflow parallel architectureEricsson computer science lab 1985-1988 NetInsight, (Erlang)SICS, parallel architectures 1988-1993 COMA, (Virtutech)S h f h 993 999Sun Microsystems, chief architect servers 1993 – 1999

WildFire, E6000, E15000, E25000Professor Uppsala University 1999 – New modeling + Acumemo o Upp a a U y od g uStartup Acumem 2006 – 2010 ThreadSpotterChief scientist Rogue Wave Software 2011 –

AVDARK

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh2

AVDARK2011

2

Goal for this courseUnderstand how and why modern computer systems are designed the way the are:

pipelinesmemory organizationvirtual/physical memory ...

Understand how and why multiprocessors are builtCache coherenceMemory modelsSynchronization…

Understand how and why parallelism is created andInstruction-level parallelismMemory-level parallelismThread-level parallelism…

Understand how and why multiprocessors of combined SIMD/MIMD type are builtGPUVector processing…

Understand how computer systems are adopted to different usage areas

AVDARK

General-purpose processorsEmbedded/network processors…

Understand the physical limitation of modern computers


AVDARK2011

BandwidthEnergyCooling…

AVDARK on the webAVDARK on the web

www it uu se/edu/course/homepage/avdark/ht11www.it.uu.se/edu/course/homepage/avdark/ht11

Welcome!Welcome!NewsFAQQScheduleSlidesN PNew PapersAssignmentsReading instr 4:ed

AVDARK

Reading instr 4:edExam


AVDARK2011

AVDARK in a nutshellLiterature Computer Architecture A Quantitative Approach (4th edition)

John Hennesey & David Pattersson

Lecturer Erik Hagersten gives most lectures and is responsible for the courseAndreas Sandberg is responsible for the labs and the hand-ins.Jakob Carlström from Xelerated will teach network processorsJakob Carlström from Xelerated will teach network processors.Sverker Holmgren will teach parallel programming.David Black-Schaffer will teach about graphics processors.

Mandatory Assignment

There are four lab assignments that all participants have to complete before a hard deadline. Each can earn you a bonus point

Optional Assignment

There are four (optional) hand-in assignments. Each can earn you a bonus point

AVDARK

Examination Written exam at the end of the course. No books are allowed.

Bonus system 64p max/32p to pass For each bonus point there is a


AVDARK2011

Bonus system 64p max/32p to pass. For each bonus point, there is a corresponding question 4p bonus question. Full bonus Pass.

Schedule in a nutshellSchedule in a nutshell1. Memory Systems (~Appendix C in 4th Ed)

Caches, VM, DRAM, microbenchmarks, optimizing SW

2. Multiprocessors2. MultiprocessorsTLP: coherence, memory models, interconnects, scalability, clusters, …

3 Scalable Multiprocessors3. Scalable MultiprocessorsScalability, synchornization, clusters, …

4. CPUsILP: pipelines, scheduling, superscalars, VLIWs, Vecotor instructions…

AVDARK

5. Widening + Future (~Chapter 1 in 4th Ed)Technology impact, GPUs, Network processors, Multicores (!!)


AVDARK2011

Exam and bonus structureExam and bonus structure4 Mandatory labs4 Mandatory labs4 Hand-in (optional)Written Exam

How to get a bonus point:Complete extra bonus activity at lab occationComplete extra bonus activity at lab occationComplete optional bonus hand-in [with a reasonable accuracy] befor a hard dealine

AVDARK

reasonable accuracy] befor a hard dealine


AVDARK2011 32p/64p at the exam = PASS

Crash Course inCrash Course inComputer ArchitectureComputer Architecture

(covering the course in 45 min)

E ik H tErik HagerstenUppsala Universitypp y

≈30 APZ 212 @ 5MH≈30 years ago: APZ 212 @ 5MHz”the AXE supercomputer”

AVDARK


AVDARK2011

APZ 212APZ 212marketing brochure quotes:

”Very compact”6 times the performancep1/6:th the size1/5 the power consumption

”A b kth h i t i ””A breakthrough in computer science””Why more CPU power?””All the power needed for future development”All the power needed for future development”…800,000 BHCA, should that ever be needed””SPC computer science at its most elegance”

AVDARK

SPC computer science at its most elegance”Using 64 kbit memory chips””1500W power consumption


AVDARK2011

1500W power consumption

CPU ImprovementspRelative Performance[log scale]

1000

100

1010

AVDARK

1970 1980 1990 2000 Year

1


AVDARK2011

How to get efficient architectures…g

Higher clock rateHigher clock rateCreate and explore locality:

) S ti l l lit a) Spatial locality b) Temporal locality

G hi l l litc) Geographical locality

Create and explore parallelismp pa) Instruction level parallelism (ILP)b) Thread level parallelism (TLP)

AVDARK

) p ( )c) Memory level parallelism (MLP)


AVDARK2011

Load/Store architecture (e.g., ”RISC”)Load/Store architecture (e.g., RISC )ALU ops: Reg -->RegMem ops: Reg <--> MemMem ops: Reg < > Mem

ExplicitRegisters

Mem

RegistersLoad/Store

Three Regs:Source1

ALU

Source1Source2DestinationLoad R1, [A]

AVDARK

MemoryAccesses

Example: C = A + B, [ ]

Load R3, [B]Add R2, R1, R3S R2 [C]

Compiler


AVDARK2011

AccessesStore R2, [C]

Lifting the CPU hood (simplified…)Lifting the CPU hood (simplified…)

DInstructions:

DCBAA

CPU

Mem

AVDARK


AVDARK2011

PipelinePipeline

DInstructions:

DCBAA

I R X WRegs

Mem

Regs

AVDARK


AVDARK2011

PipelinePipeline

AA

I R X WRegs

Mem

Regs

AVDARK


AVDARK2011

PipelinePipeline

AA

I R X WRegs

Mem

Regs

AVDARK


AVDARK2011

PipelinePipeline

AA

I R X WRegs

Mem

Regs

AVDARK


AVDARK2011

Pipeline:Pipeline:

AA

I R X WRegs

I = Instruction fetchR = Read registerX = ExecuteW= Write register/mem

Mem

Regs W= Write register/mem

AVDARK


AVDARK2011

Example ALU operation:

R1 := R2 op R3(ADD R1, R2, R3)

AA OPe.g., +, -, *, /Ifetch

I R X WRegs

23 1

I = Instruction fetchR = Read registerX = ExecuteW= Write register/mem

Mem

Regs W= Write register/mem

AVDARK


AVDARK2011

InitiallyInitially

D IF RegC < 100 GOTO ADCBA LD RegA (100 + RegC)

IF RegC < 100 GOTO A

RegB := RegA + 1

RegC := RegC + 1

A LD RegA, (100 + RegC)

PC

I R X WRegs

Mem

Regs

AVDARK


AVDARK2011

Cycle 1Cycle 1



RegB := RegA + 1

RegC := RegC + 1

PC A LD RegA, (100 + RegC)PC

I R X WRegs

Mem

Regs

AVDARK


AVDARK2011

Cycle 2Cycle 2



RegB := RegA + 1

RegC := RegC + 1

PCA LD RegA, (100 + RegC)

I R X WRegs

Mem

Regs

AVDARK


AVDARK2011

Cycle 3Cycle 3



RegB := RegA + 1

RegC := RegC + 1 PC


+

I R X WRegs

+

Mem

Regs

AVDARK


AVDARK2011

Cycle 4Cycle 4

D IF RegC < 100 GOTO APC DCBA LD RegA (100 + RegC)


RegB := RegA + 1

RegC := RegC + 1 PC


+I R X WRegs

+

Mem

Regs

AVDARK


AVDARK2011

Cycle 5Cycle 5



RegB := RegA + 1

RegC := RegC + 1

PC A LD RegA, (100 + RegC)PC

I R X WRegs

Mem

Regs

AVDARK


AVDARK2011

Data dependency Previous execution example wrong! Previous execution example wrong!



RegB := RegA + 1

RegC := RegC + 1


I R X WRegs

Mem

Regs

AVDARK


AVDARK2011

Data dependency fix: pipeline delays

D IF RegC < 100 GOTO A

”Stall”

CB R B R A 1

RegC := RegC + 1

”Stall”

PC B RegB := RegA + 1

”Stall”

”Stall”

PC

A

I R X W

LD RegA, (100 + RegC)

AVDARK

I R X WRegs


AVDARK2011 Mem

Branch Delay: Each Iteration takes 11c

”Stall”

”Stall”PC

D IF RegC < 100 GOTO A

”Stall”

”Stall”

CB R B R A 1

RegC := RegC + 1

”Stall”

B RegB := RegA + 1

”Stall”

”Stall”

A

I R X W

LD RegA, (100 + RegC)

Need to find Instruction-

AVDARK

I R X WRegs PC := next instruction address

Need to find InstructionLevel Parallelism (ILP) to avoid stalls!


AVDARK2011 Mem

It is actually a lot worse!yModern CPUs:”superscalars” with ~4 parallel pipelines

I R X W

I R X W

I R X W

Issue

I R X W

I R X Wlogic

Mem

Regs

N d t fi d 4 ILPMem

+Higher throughputMore complicated architecture

Need to find 4x more ILP

AVDARK

- More complicated architecture- Branch delay more expensive (more instr. missed)- Harder to find ”enough” independent instr. (need 8


AVDARK2011

Harder to find enough independent instr. (need 8 instr. between write and use)

It is actually a lot worse:Modern CPUs: ~10-20 pilepline stagesModern CPUs: ~10-20 pilepline stages

I R B M MWI

I R B M MW

I R B M MW

I

IIssuelogic

I R

I R B M MW

I

Ilogic

M

RegsMore pipeline stages can run at higher freq ☺

MemBut will need to find ~10x more ILP

AVDARK

+Shorter cycletime (higher GHz)- Branch delay even more expensive


AVDARK2011 - Even harder to find ”enough” independent instr.

Fix: Out-of order execution:Improving ILPImproving ILP

LD R1, M(100)ADD R3, R2, R1

The HW may execute instructions ina different order, but will make the ”side-effects” of the instructions appear in order.

SUB R5, R6, R7ST R5, M(100)

instructions appear in order.

Assume that LD takes a long time.The ADD is dependent on the LD Start the SUB and ST before the ADDUpdate R5 and M(100) after R3

>=0?YN

Update R5 and M(100) after R3

LD ...ADD ...SUB ...ST

AVDARK

ST ...


AVDARK2011

Fix: Branch prediction

LD R1, M(100)ADD R3, R2, R1

The HW can guess if the branch istaken or not and avoid branch stalls

SUB R5, R6, R7ST R5, M(100)

if the guess is correct,

Assume the guess is ”Y”.

>=0?YN

The HW can start to execute theseinstruction before the outcome the the branch is known, but cannot allow any ”side-effect” t t k l til th t i kto take place until the outcome is known

LD ...ADD ...SUB ...ST

AVDARK

ST ...


AVDARK2011

Fix: Scheduling Past BranchesImproving ILPImproving ILP

LDADD

P edicted pathSUBST

”Predict taken”All instructions along thepredicted path can be

Predicted path

>=0?LDADDSUBST

YPredict taken predicted path can be

executedout-of-order

>1?

Y

”Predict taken”

LDADDSUB

LDADDSUBST

AVDARK

SUSTST

Y”Predict taken”


AVDARK2011 <2?=0?

Fix: Scheduling Past BranchesImproving ILP

LDADD

Act al path!

Improving ILP

SUBST

Actual path!

>=0?LDADDSUBST

Y

>1?

YWrong Prediction!!!

LDADDSUB

LDADDSUBST

Throw awayi.e., no sideeffects!

AVDARK

SUSTST

Y

effects!


AVDARK2011 <2?=0?

It is actually a lot worse:Modern MEM: ~200 CPU cyclesModern MEM: ~200 CPU cycles

I R B M MWI

I R B M MW

I R B M MW

I

IIssuelogic

I R

I R B M MW

I

Ilogic

Regs

200 cycles Need more ILP again!

AVDARK


AVDARK2011 Mem

Fix: Use cache(s)Fix: Use cache(s)

I R B M MWI

I R B M MW

I R B M MW

I

IIssuelogic

I R B M MW

I R B M MW

I

Ilogic

RegsStill need some more ILP, but also

<< 1GB$1-30 cycles

,Memory-Level Parallelism (MLP)

AVDARK 200 l 1GB

$


AVDARK2011 Mem200cycles 1GB

Woops, using too much power 2007

Running at 2x the frequency will use h th 2 th much more than 2x the power

It is also really hard to find enough ILP

AVDARK


AVDARK2011

Fix: Multicore

MMem

Mem I/F

ExternalI/F

But now we also need to find Thread-Level Parallelism (TLP)

$1 $1 $1 $1

L2$

AVDARK

CPU CPU CPU CPU


AVDARK2011 t treads

Example: Intel i7 ”Nehalem”Example: Intel i7 Nehalem

Coherence Coherence

DRAM

AVDARK


AVDARK2011

How is the silicon used (i7-Ex)?

3 Mbyte

AVDARK

Tag


AVDARK2011

Data array

Source: JSSC Jan 2010, Rusu et. al

How is the silicon used?Quick Path Interconnect

L3Ex

Ord

Cache L1 + L2Transl. $Branch $

Quick Path Interconnect

Memory Interface

Ex

Cache L1 + L2Transl. $Branch $

AVDARK

Tag

I decodeRe order

Re-orderBranch $

L3 Cache


AVDARK2011

Data array

Source: JSSC Jan 2010, Rusu et. al

I-decodeRe-order

Example: Intel i7 ”Nehalem”Example: Intel i7 NehalemCoherence = NUMA + funky memory models

Coherence Coherence

DRAMI/F

100

I/F

AVDARK

DRAM


AVDARK2011

What is computer architecture?

“Bridging the gap between programs and t i t ”transistors”

“Finding the best model to execute the programs” programs”

AVDARK

best={fast, cheap, energy-efficient, reliable, predictable, …}


AVDARK2011 …

Caches and more cachesCaches and more cachesor

spam, spam, spam and spam

Erik HagerstenErik HagerstenUppsala University, Sweden

[email protected]

Fix: Use a cacheFix: Use a cache

I R B M MWI

I R B M MW

I R B M MW

I

IIssuelogic

I R B M MW

I R B M MW

I

Ilogic

Regs

~32kB$~1 cycles

AVDARK 200 l 1GB

$


AVDARK2011 Mem200cycles 1GB

Webster about “cache”

1. cache \'kash\ n [F, fr. cacher to press, hide, fr. (assumed) VL coacticare to press] together fr L coactare to compel frcoacticare to press] together, fr. L coactare to compel, fr. coactus, pp. of cogere to compel - more at COGENT 1a: ahiding place esp. for concealing and preserving provisions or g p p g p g pimplements 1b: a secure place of storage 2: something hidden or stored in a cache

AVDARK


AVDARK2011

Cache knowledge useful when...

Designing a new computerWriting an optimized program

or compilerpor operating system …

Implementing software cachingImplementing software cachingWeb cachesP o ie

AVDARK

ProxiesFile systems


AVDARK2011

Memory/storage

SRAM DRAM disk

sram

AVDARK

2000: 1ns 1ns 3ns 10ns 150ns 5 000 000ns1kB 64k 4MB 1GB 1 TB


AVDARK2011

(1982: 200ns 200ns 200ns 10 000 000ns)

Address Book CacheLooking for Tommy’s Telephone Number

UT

TOMMY 12345 Indexing

ZYXVUTOMMY 12345

ZYXV

Indexingfunction

ÖÄÅZ

ÖÄÅZ

“Address Tag” “Data”

AVDARK

Address Tag

One entry per page =>

Data


AVDARK2011 Direct-mapped caches with 28 entries

Address Book CacheLooking for Tommy’s Number

TOMMY

UT

OMMY 12345

TOMMYindex

ZYXVUOMMY 12345

ÖÄÅZ

EQ?

AVDARK


AVDARK2011

Address Book CacheLooking for Tomas’ Number

TOMAS

UT

OMMY 12345

TOMASindex

ZYXVUOMMY 12345

ÖÄÅZ

EQ?

AVDARKMiss!


AVDARK2011 Lookup Tomas’ number in

the telephone directory

Address Book CacheLooking for Tomas’ Number

TOMAS

UT

OMMY 12345

TOMASindexOMAS 23457

ZYXVUOMMY 12345

Z

ÖÄÅ

Replace TOMMY’s datawith TOMAS’ data

AVDARK

with TOMAS’ data. There is no other choice(direct mapped)


AVDARK2011

(direct mapped)

CacheCacheaddress

haddress Memory

dd ess

CPUCache

hit

y

data (a word)

data

AVDARK


AVDARK2011

Cache Organizationg

CacheCache

OMAS 23457

TOMAS(1)

1 OMAS 23457index

(4) (4)

1

28 entries

(4) (4)Addrtag

=(1)Valid

AVDARK

&


AVDARK2011 Hit (1) Data (5 digits)

Cache Organization (really)4kB, direct mapped Ordinary

Memory

1

00100110000101001010011010100011

(?)0101001 0010011100101

ymsb lsb

index1

1k entries of 4 bytes each

(?)

0101001 0010011100101

32 bit address

”Byte”

What is a good

Addr

(?)(?)identifying

a byte in memory

a good index

function

=

Addrtag

(1)Valid

AVDARK ( )(32)&

(1)


AVDARK2011 Hit?(1) Data

Cache Organization4kB di d4kB, direct mapped

32 bit address Identifies the byte within a word00100110000101001010011010100011

(10)

32 bit address Identifies the byte within a word

msb lsb

index1

1k entries of 4 bytes each

(10)0101001 0010011100101

each(20) (20)Mem

O h d

=

Addrtag

(1)ValidOverhead:

21/32= 66%

AVDARK(32)&

(1)Latency =SRAM+CMP+AND


AVDARK2011

Hit?(1) Data

CacheCacheaddress

haddress Memory

dd ess

CPUCache

hit

y

data (a word)

data

Hi U h d id d f h h

AVDARK

Hit: Use the data provided from the cache~Hit: Use data from memory and also store it in

the cache


AVDARK2011

the cache

Cache performance parameters

Cache “hit rate” [%]Cache “miss rate” [%] (= 1 hit rate) Cache miss rate [%] (= 1 - hit_rate) Hit time [CPU cycles]Miss time [CPU cycles]Miss time [CPU cycles]Hit bandwidthMiss bandwidthMiss bandwidthWrite strategy

AVDARK

….


AVDARK2011

How to rate architecture f ?performance?

Marketing:Frequency / Numbe of coresFrequency / Numbe of cores…

Architecture “goodness”: CPI = Cycles Per InstructionIPC = Instructions Per Cycley

Benchmarking:SPEC f SPEC i t

AVDARK

SPEC-fp, SPEC-int, …TPC-C, TPC-D, …


AVDARK2011

Cache performance exampleCache performance exampleAssumption:

fi i b d id hInfinite bandwidthA perfect 1.0 CyclesPerInstruction (CPI) CPU100% instruction cache hit rate100% instruction cache hit rate

Total number of cycles =#Instr * ( (1 - mem ratio) * 1 +#Instr. ( (1 mem_ratio) 1 +

mem_ratio * avg_mem_accesstime) =

= #Instr * ( (1 mem ratio) += #Instr * ( (1- mem_ratio) + mem_ratio * (hit_rate * hit_time +

(1 - hit_rate) * miss_time)

AVDARK

CPI = 1 -mem_ratio + mem_ratio * (hit_rate * hit_time +


AVDARK2011 (1 - hit_rate) * miss_time)

Example NumbersExample Numbers

CPI = 1 - mem_ratio + mem_ratio * (hit_rate * hit_time) +mem_ratio * (1 - hit_rate) * miss_time)

mem ratio = 0 25mem_ratio = 0.25hit_rate = 0.85hit time = 3hit_time 3miss_time = 100

AVDARK

CPI = 0.75 + 0.25 * 0.85 * 3 + 0.25 * 0.15 * 100 =

0 75 + 0 64 + 3 75 = 5 14


AVDARK2011

0.75 + 0.64 + 3.75 = 5.14CPU HIT MISS

What if ...

CPI = 1 -mem_ratio + mem ratio * (hit rate * hit time) +mem_ratio (hit_rate hit_time) +mem_ratio * (1 - hit_rate) * miss_time)

mem_ratio = 0.25hit_rate = 0.85hit ti 3 CPU HIT MISShit_time = 3 CPU HIT MISSmiss_time = 100 == > 0.75 + 0.64 + 3.75 = 5.14

•Twice as fast CPU ==> 0.37 + 0.64 + 3.75 = 4.77

AVDARK

•Faster memory (70c) ==> 0.75 + 0.64 + 2.62 = 4.01

I hit t (0 95) > 0 75 + 0 71 + 1 25 2 71


AVDARK2011

•Improve hit_rate (0.95) => 0.75 + 0.71 + 1.25 = 2.71

H t t ff ti hHow to get more effective caches:

Larger cache (more capacity)Cache block size (larger cache lines)( g )More placement choice (more associativity)Innovative caches (victim, skewed, …)Innovative caches (victim, skewed, …)Cache hierarchies (L1, L2, L3, CMR)Latency-hiding (weaker memory models)Latency hiding (weaker memory models)Latency-avoiding (prefetching)Cache avoiding (cache bypass)

AVDARK

Cache avoiding (cache bypass)Optimized application/compiler


AVDARK2011 …

Wh d i i hWhy do you miss in a cache

Mark Hill’s three “Cs”Compulsory miss (touching data for the first time)Capacity miss (the cache is too small)Conflict misses (non-ideal cache implementation)(too many names starting with “H”)

(Multiprocessors)Communication (imposed by communication)

AVDARK

False sharing (side-effect from large cache blocks)


AVDARK2011

Avoiding Capacity Misses –a huge address booka huge address bookLots of pages. One entry per page.

ÖU

ÖT

LING 12345 New

Y

X

ÖV

ÖULING 12345

Ö

ÖY

ÖX

NewIndexingfunction

Ö

Ä

Å

Z

ÖÖ

ÖÄ

ÖÅ

ÖZfunction

“Address Tag” “Data”

AVDARK

Address Tag

One entry per page =>

Data


AVDARK2011 Direct-mapped caches with 282 entries 784 entries

Cache Organization1MB di d1MB, direct mapped

32 bit address Identifies the byte within a word00100110000101001010011010100011

(18)


msb lsb

index1

256k entries

(18)0101001 0010011100101

Mem (12) (12)

Mem Overhead:13/32= 40%

=

Addrtag

(1)Valid

13/32= 40%

Latency =

AVDARK(32)&

(1)Latency =SRAM+CMP+AND


AVDARK2011

Hit?(1) Data

Pros/Cons Large Caches++ The safest way to get improved hit rate

SRAM i !!-- SRAMs are very expensive!!-- Larger size ==> slower speed

more load on “signals”longer distanceslonger distances

-- (power consumption)( li bilit )

AVDARK

-- (reliability)


AVDARK2011

Why do you hit in a cache?Temporal locality

Lik l t th d t i Likely to access the same data again soon

Spatial localityLikely to access nearby data again soon

Typical access pattern:(inner loop stepping through an array) A, B, C, A+1, B, C, A+2, B, C, ...

AVDARK

, , , , , , , , ,

temporal spatial


AVDARK2011

temporal spatial

Fetch more than a word: cache blocks(a.k.a cache line)

Identifies the word within a cache lineIdentifies a byte within a word

cac e b oc s(a a cac e e)1MB, direct mapped, CacheLine=16B

Identifies a byte within a word00100110000101001010011010100011

(16)msb lsb

1

64k entries

0101001 0010011100101(16)index

0010011100101 0010011100101 0010011100101

Mem entries(12) (12)

(32) (32) (32) (32)Overhead:13/128= 10%

Multiplexer(2)= 128 bitsLatency =SRAM+CMP+AND

AVDARK (32)

Multiplexer(4:1 mux)

( )Select

&(1)

8 b sSRAM+CMP+AND


AVDARK2011

(32)DataHit?(1)

Example in ClasspDirect mapped cache:

Cache size = 64 kB C h li 16 BCache line = 16 BWord size = 4B32 bits address (byte addressable)

“There are 10 kinds of people in the world:Th h d t d bi b d

AVDARK

Those who understand binary number and those who do not.”


AVDARK2011

Pros/Cons Large Cache Lines/ g++ Explores spatial locality++ Fits well with modern DRAMs++ Fits well with modern DRAMs

* first DRAM access slow* subsequent accesses fast (“page mode”)

-- Poor usage of SRAM & BW for some patternsPoor usage of SRAM & BW for some patterns-- Higher miss penalty (fix: critical word first)

(False sharing in multiprocessors)-- (False sharing in multiprocessors)

P f

AVDARK

Perf


AVDARK2011

Cache line size

UART: StatCache Graph app=matrix multiplyapp matrix multiply

Small cache:Short cache lines are better

Large cachesLarge cachesLonger cache lines are better

Huge caches:Everything fits regardless of CL size

AVDARK

Everything fits regardless of CL size


AVDARK2011 Thanks: Dr. Erik Berg

Note: this is just a single example, butthe conclusion typically holds for mostapplications.

Cache ConflictsTypical access pattern:

(inner loop stepping through an array) (inner loop stepping through an array) A, B, C, A+1, B, C, A+2, B, C, …

temporal spatial

What if B and C index to the same cache location

AVDARK

What if B and C index to the same cache locationConflict misses -- big time!Potential performance loss 10-100x


AVDARK2011

p

Address Book CacheTwo names per page: index first, then search.

TOMMY

UT

OMAS

TOMMYindex

23457

ZYXVUOMAS

EQ?

23457

ÖÄÅZ

12345OMMY

EQ?EQ?

AVDARK


AVDARK2011

Avoiding conflict: More associativityg y1MB, 2-way set-associative, CL=4B

Identifies a byte within a word

1

00100110000101001010011010100011

(17)0101001 0010011100101 1 0101001 0010011100101

msb lsb

index1

128k “sets”

0101001 0010011100101 1 0101001 0010011100101

(13) (13) One “set”

(32) (32)= =

Multiplexer&(1)

&

AVDARK H h ld th (32)

(2:1 mux)

Hit?

(1)Select“logic”Latency =

SRAM+CMP+AND+


AVDARK2011

How should the select signal be

produced?

DataLOGIC+MUX

Pros/Cons Associativity

++ Avoids conflict misses-- Slower access time

More complex implementation-- More complex implementationcomparators, muxes, ...

-- Requires more pins (for external SRAM )

AVDARK

SRAM…)


AVDARK2011

Going all the way…!1MB f ll i i CL 16B1MB, fully associative, CL=16B

Identifies a byte within a wordIdentifies the word within a cache line

00100110000101001010011010100011One “set”

(0)

Identifies a byte within a word

(2)index

1(0)

(28)0101001 0010011100101

(13)

1 0101001 0010011100101 1 0101001 0010011100101 1

= 16B 16B= = = 16B...

& Multiplexer& & &64kcomparators

AVDARK Hit? 4B

u p e e(256k:1 mux)

Select (16)“logic”

p


AVDARK2011

Hit? 4BData

Fully Associative

Very expensiveOnly used for small caches (and sometimes TLBs)so et es s)

CAM = Contents-addressable memory~Fully-associative cache storing key+data

AVDARK

y g yProvide key to CAM and get the associated data


AVDARK2011

data

A combination thereof1MB, 2-way, CL=16B1MB, 2 way, CL=16B


001001100001010010100110101000110

(15)

de es by e w wo d

msb lsb

index1

32k “sets”

( )

(13)0101001 0010011100101

(13)

1 0101001 0010011100101

=sets

(128) (128)=

& Multiplexer&(2)

(256)

AVDARK Hit? (32)

(8:1 mux)(1)Select“logic”


AVDARK2011

Hit? ( )Data

Example in Class

Cache size = 2 MBCache line = 64 BWord size = 8B (64 bits)Word size = 8B (64 bits)4-way set associative32 bits address (byte addressable)

AVDARK


AVDARK2011

Who to replace?pPicking a “victim”

L t tl d ( k LRU)Least-recently used (aka LRU)Considered the “best” algorithm (which is not always true )always true…)Only practical up to limited number of ways

Not most recently usedNot most recently usedRemember who used it last: 8-way -> 3 bits/CL

Pseudo LRUPseudo-LRUE.g., based on course time stamps. Used in the VM system

AVDARK

Used in the VM systemRandom replacement

Can’t continuously to have “bad luck


AVDARK2011

Can t continuously to have bad luck...

Cache Model: Random vs. LRU

LRURandom

Random

LRU

AVDARK

art (SPEC 2000) equake (SPEC 2000)


AVDARK2011

4-way sub-blocked cache1MB, direct mapped, Block=64B, sub-block=16B1MB, direct mapped, Block=64B, sub block=16B


Sub block within a block

1

00100110000101001010011010100011

(14)0101001 0010011100101 00100111001010 1 0

msb lsb

index1

16k 0101001 0010011100101 00100111001010 1 0

Mem

=

(12) (12)(128) (128) (128) (128)Overhead:

16/512= 3% =

& 16:1 mux(2)

512 bits& & &

AVDARK

(4)

(32)

& 16:1 mux& & &

logic4:1 mux(2)


AVDARK2011 Hit? Data

Pros/Cons Sub-blocking

++ Lowers the memory overhead++ (Avoids problems with false sharing -- MP)++ Avoids problems with bandwidth waste ++ Avoids problems with bandwidth waste -- Will not explore as much spatial locality

Still poor utilization of SRAM -- Still poor utilization of SRAM -- Fewer sparse “things” allocated

AVDARK


AVDARK2011

Replacing dirty cache linesWrite-back

W it di t d t b k t ( t l l) t Write dirty data back to memory (next level) at replacementA “dirty bit” indicates an altered cache lineA dirty bit indicates an altered cache line

Write-through l i h h h l l ( ll)Always write through to the next level (as well)

data will never be dirty no write-backs

AVDARK


AVDARK2011

Write Buffer/Store Buffer/

Do not need the old value for a storeDo not need the old value for a store

cachecache

WB:==WB:

stores loads

==

CPU

AVDARKOne option: Write around (no write allocate


AVDARK2011

p (in caches) used for lower level smaller caches

Innovative cache: Victim cache

address

Cacheaddress

hit Memory

CPU datahit Memory

dd

data

VCaddresshit

Victim Cache (VC): a small fairly associative cache (~10s of entries)data (a word)

AVDARK

Victim Cache (VC): a small, fairly associative cache (~10s of entries)Lookup: search cache and VC in parallelCache replacement: move victim to the VC and replace in VC


AVDARK2011

p pVC hit: swap VC data with the corresponding data in Cache“A second life ☺”

Skewed Associative CacheSkewed Associative CacheA, B and C have a three-way conflict

2 way 4-way 2-way skewed2-wayAB

4 wayAB

2 way skewedAB

C C C

It has been shown that 2-way skewed performs roughly the same as 4-way caches

AVDARK

roughly the same as 4 way caches


AVDARK2011

Skewed-associative cache:Different indexing functionsDifferent indexing functions


1

00100110000101001010011010100011

(17)0101001 0010011100101f(>18)

msb lsb

index1

128k entries(13)

0101001 0010011100101

1 0101001 0010011100101

f1

f

( 18)

(>18) (17)1 0101001f2

= = (32) (32)

AVDARK

& 2:1mux&

function (32)


AVDARK2011

function (32)

UART: Elbow cacheUART: Elbow cacheIncrease “associativity” when needed

If severe conflict:make roomConflict!!

AB

AB

AB

CC

Performs roughly the same as an 8-way cacheSlightly faster

AVDARK

Slightly fasterUses much less power!!


AVDARK2011

Topology of caches: Harvard Arch

CPU needs a new instruction each cycleCPU needs a new instruction each cycle25% of instruction LD/ST D t d I t h diff t ttData and Instr. have different access patterns

==> Separate D and I first level cache==> Unified 2nd and 3rd level caches

AVDARK


AVDARK2011

Cache Hierarchy of TodayCache Hierarchy of Today

DRAM Memory

L3$ Off-chip SRAM (today more commonly on chip)

L2$

$

Use whatever transistors there

commonly on-chip)

L1 D$hi

$on-chip

L1 I$hi

i left for the Last-Level Cache (LLC)

AVDARKCPU

on-chip on-chip

Small enough to keep up with the CPU speed

Separate I cache to allow for instruction fetch and data fetch in


AVDARK2011

U p d and data fetch in parallel

HW prefetching ...a little green man that anticipates your next

memory access and prefetches the data to y pthe cache.

Improves MLP!Sequential prefetching: Sequential streams [to a page]. Some number of prefetch streams supported Often only for L2 and L3 streams supported. Often only for L2 and L3. PC-based prefetching: Detects strides from the same PC Often also for L1

AVDARK

the same PC. Often also for L1. Adjacent prefetching: On a miss, also bring in the “next” cache line. Often only for L2 and


AVDARK2011

yL3.

Hardware prefetchingHardware ”monitor” looking for patterns in memory accessesmemory accessesBrings data of anticipated future accesses into the cache prior to their usageinto the cache prior to their usageTwo major types:

Sequential prefetching (typically page-based 2nd Sequential prefetching (typically page-based, 2nd level cache and higher). Detects sequential cache lines missing in the cache.

AVDARK

PC-based prefetching, integrated with the pipeline. Finds per-PC strides. Can find more complicated patterns


AVDARK2011

patterns.

Cache Capacity/Latency/BW

AVDARK


AVDARK2011

Cache implementation Cacheline, here 64B: Cache implementationAT S Data = 64B

Generic Cache:

Addr [63..0]MSB LSB SRAM:

Caches at all level roughly work like this:

L3 € 24MB

index

L2 $256kB

==

index

256kB

D1 ¢

...

=======

I1 ¢

AVDARK

D1 ¢64kB

muxHit

Sel way ”6”

I1 ¢64kB


AVDARK2011

Data = 64B

Address Book AnalogygyTwo names per page: index first, then search.

TOMMY

UT

OMAS

TOMMYindex

23457

ZYXVUOMAS

EQ?

23457

ÖÄÅZ

12345OMMY

EQ?EQ?

AVDARK

Select the second entry!


AVDARK2011

Cache lingogCacheline: Data chunk move to/from a cache Cache set: Fraction of the cache identified by the indexyAssociativity: Number of alternative storage places for a cachelineReplacement policy: picking the victim to throw out Replacement policy: picking the victim to throw out from a set (LRU/Random/Nehalem) Temporal locality: Likelihood to access the same data p yagain soonSpatial locality: Likelihood to access nearby data again soonsoon

Typical access pattern:(inner loop stepping through an array)

AVDARK

(inner loop stepping through an array) A, B, C, A+4, B, C, A+8, B, C, ...


AVDARK2011

temporal spatial

Cache Lingo Picture Cacheline, here 64B: Cache Lingo PictureAT S Data = 64B

Generic Cache:

Addr [63..0]MSB LSB SRAM:

Cache Set:

tag index

CAssociativity= 8-way

...

AVDARK

muxHit

Sel way ”6”


AVDARK2011

Data = 64B

E l N h l i7 ( l )Exempel Nehalem i7 (one example)

AVDARK

32kB8-way pLRU

256kB8-way pLRUnon-incl

8MB16-way Neh.-replnon-incl


AVDARK2011

non incl non incl

Take-away message: CachesTake away message: Caches

Cache are fast but smallCache space in cache-line chunks (~64bytes) LSB part of the address is used to find the LSB part of the address is used to find the ”set” (aka, indexing)Th i li it d b f h li There is a limited number of cache lines per set (associativity)Typically, several levels of caches

AVDARK

The most important target for optimizations


AVDARK2011

How are we doing?g

Create and explore locality:Create and explore locality:a) Spatial locality b) Temporal localityb) Temporal localityc) Geographical locality

Create and explore parallelismCreate and explore parallelisma) Instruction level parallelism (ILP)b) Thread level parallelism (TLP)c) Memory level parallelism (MLP)

AVDARK


AVDARK2011

Documents

Goal for this course AVDARK on the web - Uppsala … computer science lab 1985-1988 ÎNetInsight, ... ≈30 APZ 212 @ 5MH30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer