18
Welcome to AVDARK E ik H t Erik Hagersten Uppsala University Ericsson, CPU designer 1982-84: ÎAPZ212 MIT 1984-85: ÎDataflow parallel architecture Ericsson computer science lab 1985-1988 ÎNetInsight, (Erlang) SICS, parallel architectures 1988-1993 ÎCOMA, (Virtutech) S hf h 993 999 Sun Microsystems, chief architect servers 1993 –1999 Î WildFire, E6000, E15000, E25000 Professor Uppsala University 1999 – Î New modeling + Acumem Startup Acumem 2006 – 2010 ÎThreadSpotter Chief scientist Rogue Wave Software 2011 – Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh 2 AVDARK 2011 2 Goal for this course Understand how and why modern computer systems are designed the way the are: pipelines memory organization virtual/physical memory ... Understand how and why multiprocessors are built Cache coherence Memory models Synchronization… Understand how and why parallelism is created and Instruction-level parallelism Memory-level parallelism Thread-level parallelism… Understand how and why multiprocessors of combined SIMD/MIMD type are built GPU Vector processing… Understand how computer systems are adopted to different usage areas AVDARK General-purpose processors Embedded/network processors… Understand the physical limitation of modern computers Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh 3 AVDARK 2011 Bandwidth Energy Cooling… AVDARK on the web AVDARK on the web www it uu se/edu/course/homepage/avdark/ht11 www.it.uu.se/edu/course/homepage/avdark/ht11 Welcome! Welcome! News FAQ Schedule Slides N P New Papers Assignments Reading instr 4:ed AVDARK Reading instr 4:ed Exam Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh 4 AVDARK 2011 AVDARK in a nutshell Literature Computer Architecture A Quantitative Approach (4th edition) John Hennesey & David Pattersson Lecturer Erik Hagersten gives most lectures and is responsible for the course Andreas Sandberg is responsible for the labs and the hand-ins. Jakob Carlström from Xelerated will teach network processors Jakob Carlström from Xelerated will teach network processors. Sverker Holmgren will teach parallel programming. David Black-Schaffer will teach about graphics processors. Mandatory Assignment There are four lab assignments that all participants have to complete before a hard deadline. Each can earn you a bonus point Optional Assignment There are four (optional) hand-in assignments. Each can earn you a bonus point AVDARK Examination Written exam at the end of the course. No books are allowed. Bonus system 64p max/32p to pass For each bonus point there is a Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh 5 AVDARK 2011 Bonus system 64p max/32p to pass. For each bonus point, there is a corresponding question 4p bonus question. Full bonus Î Pass. Schedule in a nutshell Schedule in a nutshell 1. Memory Systems (~Appendix C in 4th Ed) Caches, VM, DRAM, microbenchmarks, optimizing SW 2. Multiprocessors 2. Multiprocessors TLP: coherence, memory models, interconnects, scalability, clusters, … 3 Scalable Multiprocessors 3. Scalable Multiprocessors Scalability, synchornization, clusters, … 4. CPUs ILP: pipelines, scheduling, superscalars, VLIWs, Vecotor instructions… AVDARK 5. Widening + Future (~Chapter 1 in 4th Ed) Technology impact, GPUs, Network processors, Multicores (!!) Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh 6 AVDARK 2011

Goal for this course AVDARK on the web - Uppsala … computer science lab 1985-1988 ÎNetInsight, ... ≈30 APZ 212 @ 5MH30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer

  • Upload
    doannhi

  • View
    216

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Goal for this course AVDARK on the web - Uppsala … computer science lab 1985-1988 ÎNetInsight, ... ≈30 APZ 212 @ 5MH30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer

Welcome to AVDARK

E ik H tErik HagerstenUppsala Universitypp y

Ericsson, CPU designer 1982-84: APZ212MIT 1984-85: Dataflow parallel architectureEricsson computer science lab 1985-1988 NetInsight, (Erlang)SICS, parallel architectures 1988-1993 COMA, (Virtutech)S h f h 993 999Sun Microsystems, chief architect servers 1993 – 1999

WildFire, E6000, E15000, E25000Professor Uppsala University 1999 – New modeling + Acumemo o Upp a a U y od g uStartup Acumem 2006 – 2010 ThreadSpotterChief scientist Rogue Wave Software 2011 –

AVDARK

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh2

AVDARK2011

2

Goal for this courseUnderstand how and why modern computer systems are designed the way the are:

pipelinesmemory organizationvirtual/physical memory ...

Understand how and why multiprocessors are builtCache coherenceMemory modelsSynchronization…

Understand how and why parallelism is created andInstruction-level parallelismMemory-level parallelismThread-level parallelism…

Understand how and why multiprocessors of combined SIMD/MIMD type are builtGPUVector processing…

Understand how computer systems are adopted to different usage areas

AVDARK

General-purpose processorsEmbedded/network processors…

Understand the physical limitation of modern computers

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh3

AVDARK2011

BandwidthEnergyCooling…

AVDARK on the webAVDARK on the web

www it uu se/edu/course/homepage/avdark/ht11www.it.uu.se/edu/course/homepage/avdark/ht11

Welcome!Welcome!NewsFAQQScheduleSlidesN PNew PapersAssignmentsReading instr 4:ed

AVDARK

Reading instr 4:edExam

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh4

AVDARK2011

AVDARK in a nutshellLiterature Computer Architecture A Quantitative Approach (4th edition)

John Hennesey & David Pattersson

Lecturer Erik Hagersten gives most lectures and is responsible for the courseAndreas Sandberg is responsible for the labs and the hand-ins.Jakob Carlström from Xelerated will teach network processorsJakob Carlström from Xelerated will teach network processors.Sverker Holmgren will teach parallel programming.David Black-Schaffer will teach about graphics processors.

Mandatory Assignment

There are four lab assignments that all participants have to complete before a hard deadline. Each can earn you a bonus point

Optional Assignment

There are four (optional) hand-in assignments. Each can earn you a bonus point

AVDARK

Examination Written exam at the end of the course. No books are allowed.

Bonus system 64p max/32p to pass For each bonus point there is a

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh5

AVDARK2011

Bonus system 64p max/32p to pass. For each bonus point, there is a corresponding question 4p bonus question. Full bonus Pass.

Schedule in a nutshellSchedule in a nutshell1. Memory Systems (~Appendix C in 4th Ed)

Caches, VM, DRAM, microbenchmarks, optimizing SW

2. Multiprocessors2. MultiprocessorsTLP: coherence, memory models, interconnects, scalability, clusters, …

3 Scalable Multiprocessors3. Scalable MultiprocessorsScalability, synchornization, clusters, …

4. CPUsILP: pipelines, scheduling, superscalars, VLIWs, Vecotor instructions…

AVDARK

5. Widening + Future (~Chapter 1 in 4th Ed)Technology impact, GPUs, Network processors, Multicores (!!)

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh6

AVDARK2011

Page 2: Goal for this course AVDARK on the web - Uppsala … computer science lab 1985-1988 ÎNetInsight, ... ≈30 APZ 212 @ 5MH30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer

Exam and bonus structureExam and bonus structure4 Mandatory labs4 Mandatory labs4 Hand-in (optional)Written Exam

How to get a bonus point:Complete extra bonus activity at lab occationComplete extra bonus activity at lab occationComplete optional bonus hand-in [with a reasonable accuracy] befor a hard dealine

AVDARK

reasonable accuracy] befor a hard dealine

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh7

AVDARK2011 32p/64p at the exam = PASS

Crash Course inCrash Course inComputer ArchitectureComputer Architecture

(covering the course in 45 min)

E ik H tErik HagerstenUppsala Universitypp y

≈30 APZ 212 @ 5MH≈30 years ago: APZ 212 @ 5MHz”the AXE supercomputer”

AVDARK

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh9

AVDARK2011

APZ 212APZ 212marketing brochure quotes:

”Very compact”6 times the performancep1/6:th the size1/5 the power consumption

”A b kth h i t i ””A breakthrough in computer science””Why more CPU power?””All the power needed for future development”All the power needed for future development”…800,000 BHCA, should that ever be needed””SPC computer science at its most elegance”

AVDARK

SPC computer science at its most elegance”Using 64 kbit memory chips””1500W power consumption

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh10

AVDARK2011

1500W power consumption

CPU ImprovementspRelative Performance[log scale]

1000

100

1010

AVDARK

1970 1980 1990 2000 Year

1

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh11

AVDARK2011

How to get efficient architectures…g

Higher clock rateHigher clock rateCreate and explore locality:

) S ti l l lit a) Spatial locality b) Temporal locality

G hi l l litc) Geographical locality

Create and explore parallelismp pa) Instruction level parallelism (ILP)b) Thread level parallelism (TLP)

AVDARK

) p ( )c) Memory level parallelism (MLP)

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh12

AVDARK2011

Page 3: Goal for this course AVDARK on the web - Uppsala … computer science lab 1985-1988 ÎNetInsight, ... ≈30 APZ 212 @ 5MH30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer

Load/Store architecture (e.g., ”RISC”)Load/Store architecture (e.g., RISC )ALU ops: Reg -->RegMem ops: Reg <--> MemMem ops: Reg < > Mem

ExplicitRegisters

Mem

RegistersLoad/Store

Three Regs:Source1

ALU

Source1Source2DestinationLoad R1, [A]

AVDARK

MemoryAccesses

Example: C = A + B, [ ]

Load R3, [B]Add R2, R1, R3S R2 [C]

Compiler

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh13

AVDARK2011

AccessesStore R2, [C]

Lifting the CPU hood (simplified…)Lifting the CPU hood (simplified…)

DInstructions:

DCBAA

CPU

Mem

AVDARK

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh14

AVDARK2011

PipelinePipeline

DInstructions:

DCBAA

I R X WRegs

Mem

Regs

AVDARK

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh15

AVDARK2011

PipelinePipeline

AA

I R X WRegs

Mem

Regs

AVDARK

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh16

AVDARK2011

PipelinePipeline

AA

I R X WRegs

Mem

Regs

AVDARK

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh17

AVDARK2011

PipelinePipeline

AA

I R X WRegs

Mem

Regs

AVDARK

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh18

AVDARK2011

Page 4: Goal for this course AVDARK on the web - Uppsala … computer science lab 1985-1988 ÎNetInsight, ... ≈30 APZ 212 @ 5MH30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer

Pipeline:Pipeline:

AA

I R X WRegs

I = Instruction fetchR = Read registerX = ExecuteW= Write register/mem

Mem

Regs W= Write register/mem

AVDARK

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh19

AVDARK2011

Example ALU operation:

R1 := R2 op R3(ADD R1, R2, R3)

AA OPe.g., +, -, *, /Ifetch

I R X WRegs

23 1

I = Instruction fetchR = Read registerX = ExecuteW= Write register/mem

Mem

Regs W= Write register/mem

AVDARK

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh20

AVDARK2011

InitiallyInitially

D IF RegC < 100 GOTO ADCBA LD RegA (100 + RegC)

IF RegC < 100 GOTO A

RegB := RegA + 1

RegC := RegC + 1

A LD RegA, (100 + RegC)

PC

I R X WRegs

Mem

Regs

AVDARK

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh21

AVDARK2011

Cycle 1Cycle 1

D IF RegC < 100 GOTO ADCBA LD RegA (100 + RegC)

IF RegC < 100 GOTO A

RegB := RegA + 1

RegC := RegC + 1

PC A LD RegA, (100 + RegC)PC

I R X WRegs

Mem

Regs

AVDARK

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh22

AVDARK2011

Cycle 2Cycle 2

D IF RegC < 100 GOTO ADCBA LD RegA (100 + RegC)

IF RegC < 100 GOTO A

RegB := RegA + 1

RegC := RegC + 1

PCA LD RegA, (100 + RegC)

I R X WRegs

Mem

Regs

AVDARK

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh23

AVDARK2011

Cycle 3Cycle 3

D IF RegC < 100 GOTO ADCBA LD RegA (100 + RegC)

IF RegC < 100 GOTO A

RegB := RegA + 1

RegC := RegC + 1 PC

A LD RegA, (100 + RegC)

+

I R X WRegs

+

Mem

Regs

AVDARK

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh24

AVDARK2011

Page 5: Goal for this course AVDARK on the web - Uppsala … computer science lab 1985-1988 ÎNetInsight, ... ≈30 APZ 212 @ 5MH30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer

Cycle 4Cycle 4

D IF RegC < 100 GOTO APC DCBA LD RegA (100 + RegC)

IF RegC < 100 GOTO A

RegB := RegA + 1

RegC := RegC + 1 PC

A LD RegA, (100 + RegC)

+I R X WRegs

+

Mem

Regs

AVDARK

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh25

AVDARK2011

Cycle 5Cycle 5

D IF RegC < 100 GOTO ADCBA LD RegA (100 + RegC)

IF RegC < 100 GOTO A

RegB := RegA + 1

RegC := RegC + 1

PC A LD RegA, (100 + RegC)PC

I R X WRegs

Mem

Regs

AVDARK

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh26

AVDARK2011

Data dependency Previous execution example wrong! Previous execution example wrong!

D IF RegC < 100 GOTO ADCBA LD RegA (100 + RegC)

IF RegC < 100 GOTO A

RegB := RegA + 1

RegC := RegC + 1

A LD RegA, (100 + RegC)

I R X WRegs

Mem

Regs

AVDARK

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh27

AVDARK2011

Data dependency fix: pipeline delays

D IF RegC < 100 GOTO A

”Stall”

CB R B R A 1

RegC := RegC + 1

”Stall”

PC B RegB := RegA + 1

”Stall”

”Stall”

PC

A

I R X W

LD RegA, (100 + RegC)

AVDARK

I R X WRegs

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh28

AVDARK2011 Mem

Branch Delay: Each Iteration takes 11c

”Stall”

”Stall”PC

D IF RegC < 100 GOTO A

”Stall”

”Stall”

CB R B R A 1

RegC := RegC + 1

”Stall”

B RegB := RegA + 1

”Stall”

”Stall”

A

I R X W

LD RegA, (100 + RegC)

Need to find Instruction-

AVDARK

I R X WRegs PC := next instruction address

Need to find InstructionLevel Parallelism (ILP) to avoid stalls!

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh29

AVDARK2011 Mem

It is actually a lot worse!yModern CPUs:”superscalars” with ~4 parallel pipelines

I R X W

I R X W

I R X W

Issue

I R X W

I R X Wlogic

Mem

Regs

N d t fi d 4 ILPMem

+Higher throughputMore complicated architecture

Need to find 4x more ILP

AVDARK

- More complicated architecture- Branch delay more expensive (more instr. missed)- Harder to find ”enough” independent instr. (need 8

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh30

AVDARK2011

Harder to find enough independent instr. (need 8 instr. between write and use)

Page 6: Goal for this course AVDARK on the web - Uppsala … computer science lab 1985-1988 ÎNetInsight, ... ≈30 APZ 212 @ 5MH30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer

It is actually a lot worse:Modern CPUs: ~10-20 pilepline stagesModern CPUs: ~10-20 pilepline stages

I R B M MWI

I R B M MW

I R B M MW

I

IIssuelogic

I R

I R B M MW

I

Ilogic

M

RegsMore pipeline stages can run at higher freq ☺

MemBut will need to find ~10x more ILP

AVDARK

+Shorter cycletime (higher GHz)- Branch delay even more expensive

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh31

AVDARK2011 - Even harder to find ”enough” independent instr.

Fix: Out-of order execution:Improving ILPImproving ILP

LD R1, M(100)ADD R3, R2, R1

The HW may execute instructions ina different order, but will make the ”side-effects” of the instructions appear in order.

SUB R5, R6, R7ST R5, M(100)

instructions appear in order.

Assume that LD takes a long time.The ADD is dependent on the LD Start the SUB and ST before the ADDUpdate R5 and M(100) after R3

>=0?YN

Update R5 and M(100) after R3

LD ...ADD ...SUB ...ST

AVDARK

ST ...

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh32

AVDARK2011

Fix: Branch prediction

LD R1, M(100)ADD R3, R2, R1

The HW can guess if the branch istaken or not and avoid branch stalls

SUB R5, R6, R7ST R5, M(100)

if the guess is correct,

Assume the guess is ”Y”.

>=0?YN

The HW can start to execute theseinstruction before the outcome the the branch is known, but cannot allow any ”side-effect” t t k l til th t i kto take place until the outcome is known

LD ...ADD ...SUB ...ST

AVDARK

ST ...

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh33

AVDARK2011

Fix: Scheduling Past BranchesImproving ILPImproving ILP

LDADD

P edicted pathSUBST

”Predict taken”All instructions along thepredicted path can be

Predicted path

>=0?LDADDSUBST

YPredict taken predicted path can be

executedout-of-order

>1?

Y

”Predict taken”

LDADDSUB

LDADDSUBST

AVDARK

SUSTST

Y”Predict taken”

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh34

AVDARK2011 <2?=0?

Fix: Scheduling Past BranchesImproving ILP

LDADD

Act al path!

Improving ILP

SUBST

Actual path!

>=0?LDADDSUBST

Y

>1?

YWrong Prediction!!!

LDADDSUB

LDADDSUBST

Throw awayi.e., no sideeffects!

AVDARK

SUSTST

Y

effects!

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh35

AVDARK2011 <2?=0?

It is actually a lot worse:Modern MEM: ~200 CPU cyclesModern MEM: ~200 CPU cycles

I R B M MWI

I R B M MW

I R B M MW

I

IIssuelogic

I R

I R B M MW

I

Ilogic

Regs

200 cycles Need more ILP again!

AVDARK

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh36

AVDARK2011 Mem

Page 7: Goal for this course AVDARK on the web - Uppsala … computer science lab 1985-1988 ÎNetInsight, ... ≈30 APZ 212 @ 5MH30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer

Fix: Use cache(s)Fix: Use cache(s)

I R B M MWI

I R B M MW

I R B M MW

I

IIssuelogic

I R B M MW

I R B M MW

I

Ilogic

RegsStill need some more ILP, but also

<< 1GB$1-30 cycles

,Memory-Level Parallelism (MLP)

AVDARK 200 l 1GB

$

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh37

AVDARK2011 Mem200cycles 1GB

Woops, using too much power 2007

Running at 2x the frequency will use h th 2 th much more than 2x the power

It is also really hard to find enough ILP

AVDARK

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh38

AVDARK2011

Fix: Multicore

MMem

Mem I/F

ExternalI/F

But now we also need to find Thread-Level Parallelism (TLP)

$1 $1 $1 $1

L2$

AVDARK

CPU CPU CPU CPU

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh39

AVDARK2011 t treads

Example: Intel i7 ”Nehalem”Example: Intel i7 Nehalem

Coherence Coherence

DRAM

AVDARK

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh40

AVDARK2011

How is the silicon used (i7-Ex)?

3 Mbyte

AVDARK

Tag

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh41

AVDARK2011

Data array

Source: JSSC Jan 2010, Rusu et. al

How is the silicon used?Quick Path Interconnect

L3Ex

Ord

Cache L1 + L2Transl. $Branch $

Quick Path Interconnect

Memory Interface

Ex

Cache L1 + L2Transl. $Branch $

AVDARK

Tag

I decodeRe order

Re-orderBranch $

L3 Cache

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh42

AVDARK2011

Data array

Source: JSSC Jan 2010, Rusu et. al

I-decodeRe-order

Page 8: Goal for this course AVDARK on the web - Uppsala … computer science lab 1985-1988 ÎNetInsight, ... ≈30 APZ 212 @ 5MH30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer

Example: Intel i7 ”Nehalem”Example: Intel i7 NehalemCoherence = NUMA + funky memory models

Coherence Coherence

DRAMI/F

100

I/F

AVDARK

DRAM

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh43

AVDARK2011

What is computer architecture?

“Bridging the gap between programs and t i t ”transistors”

“Finding the best model to execute the programs” programs”

AVDARK

best={fast, cheap, energy-efficient, reliable, predictable, …}

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh44

AVDARK2011 …

Caches and more cachesCaches and more cachesor

spam, spam, spam and spam

Erik HagerstenErik HagerstenUppsala University, Sweden

[email protected]

Fix: Use a cacheFix: Use a cache

I R B M MWI

I R B M MW

I R B M MW

I

IIssuelogic

I R B M MW

I R B M MW

I

Ilogic

Regs

~32kB$~1 cycles

AVDARK 200 l 1GB

$

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh46

AVDARK2011 Mem200cycles 1GB

Webster about “cache”

1. cache \'kash\ n [F, fr. cacher to press, hide, fr. (assumed) VL coacticare to press] together fr L coactare to compel frcoacticare to press] together, fr. L coactare to compel, fr. coactus, pp. of cogere to compel - more at COGENT 1a: ahiding place esp. for concealing and preserving provisions or g p p g p g pimplements 1b: a secure place of storage 2: something hidden or stored in a cache

AVDARK

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh47

AVDARK2011

Cache knowledge useful when...

Designing a new computerWriting an optimized program

or compilerpor operating system …

Implementing software cachingImplementing software cachingWeb cachesP o ie

AVDARK

ProxiesFile systems

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh48

AVDARK2011

Page 9: Goal for this course AVDARK on the web - Uppsala … computer science lab 1985-1988 ÎNetInsight, ... ≈30 APZ 212 @ 5MH30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer

Memory/storage

SRAM DRAM disk

sram

AVDARK

2000: 1ns 1ns 3ns 10ns 150ns 5 000 000ns1kB 64k 4MB 1GB 1 TB

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh49

AVDARK2011

(1982: 200ns 200ns 200ns 10 000 000ns)

Address Book CacheLooking for Tommy’s Telephone Number

UT

TOMMY 12345 Indexing

ZYXVUTOMMY 12345

ZYXV

Indexingfunction

ÖÄÅZ

ÖÄÅZ

“Address Tag” “Data”

AVDARK

Address Tag

One entry per page =>

Data

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh50

AVDARK2011 Direct-mapped caches with 28 entries

Address Book CacheLooking for Tommy’s Number

TOMMY

UT

OMMY 12345

TOMMYindex

ZYXVUOMMY 12345

ÖÄÅZ

EQ?

AVDARK

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh51

AVDARK2011

Address Book CacheLooking for Tomas’ Number

TOMAS

UT

OMMY 12345

TOMASindex

ZYXVUOMMY 12345

ÖÄÅZ

EQ?

AVDARKMiss!

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh52

AVDARK2011 Lookup Tomas’ number in

the telephone directory

Address Book CacheLooking for Tomas’ Number

TOMAS

UT

OMMY 12345

TOMASindexOMAS 23457

ZYXVUOMMY 12345

Z

ÖÄÅ

Replace TOMMY’s datawith TOMAS’ data

AVDARK

with TOMAS’ data. There is no other choice(direct mapped)

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh53

AVDARK2011

(direct mapped)

CacheCacheaddress

haddress Memory

dd ess

CPUCache

hit

y

data (a word)

data

AVDARK

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh54

AVDARK2011

Page 10: Goal for this course AVDARK on the web - Uppsala … computer science lab 1985-1988 ÎNetInsight, ... ≈30 APZ 212 @ 5MH30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer

Cache Organizationg

CacheCache

OMAS 23457

TOMAS(1)

1 OMAS 23457index

(4) (4)

1

28 entries

(4) (4)Addrtag

=(1)Valid

AVDARK

&

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh55

AVDARK2011 Hit (1) Data (5 digits)

Cache Organization (really)4kB, direct mapped Ordinary

Memory

1

00100110000101001010011010100011

(?)0101001 0010011100101

ymsb lsb

index1

1k entries of 4 bytes each

(?)

0101001 0010011100101

32 bit address

”Byte”

What is a good

Addr

(?)(?)identifying

a byte in memory

a good index

function

=

Addrtag

(1)Valid

AVDARK ( )(32)&

(1)

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh56

AVDARK2011 Hit?(1) Data

Cache Organization4kB di d4kB, direct mapped

32 bit address Identifies the byte within a word00100110000101001010011010100011

(10)

32 bit address Identifies the byte within a word

msb lsb

index1

1k entries of 4 bytes each

(10)0101001 0010011100101

each(20) (20)Mem

O h d

=

Addrtag

(1)ValidOverhead:

21/32= 66%

AVDARK(32)&

(1)Latency =SRAM+CMP+AND

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh57

AVDARK2011

Hit?(1) Data

CacheCacheaddress

haddress Memory

dd ess

CPUCache

hit

y

data (a word)

data

Hi U h d id d f h h

AVDARK

Hit: Use the data provided from the cache~Hit: Use data from memory and also store it in

the cache

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh58

AVDARK2011

the cache

Cache performance parameters

Cache “hit rate” [%]Cache “miss rate” [%] (= 1 hit rate) Cache miss rate [%] (= 1 - hit_rate) Hit time [CPU cycles]Miss time [CPU cycles]Miss time [CPU cycles]Hit bandwidthMiss bandwidthMiss bandwidthWrite strategy

AVDARK

….

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh59

AVDARK2011

How to rate architecture f ?performance?

Marketing:Frequency / Numbe of coresFrequency / Numbe of cores…

Architecture “goodness”: CPI = Cycles Per InstructionIPC = Instructions Per Cycley

Benchmarking:SPEC f SPEC i t

AVDARK

SPEC-fp, SPEC-int, …TPC-C, TPC-D, …

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh60

AVDARK2011

Page 11: Goal for this course AVDARK on the web - Uppsala … computer science lab 1985-1988 ÎNetInsight, ... ≈30 APZ 212 @ 5MH30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer

Cache performance exampleCache performance exampleAssumption:

fi i b d id hInfinite bandwidthA perfect 1.0 CyclesPerInstruction (CPI) CPU100% instruction cache hit rate100% instruction cache hit rate

Total number of cycles =#Instr * ( (1 - mem ratio) * 1 +#Instr. ( (1 mem_ratio) 1 +

mem_ratio * avg_mem_accesstime) =

= #Instr * ( (1 mem ratio) += #Instr * ( (1- mem_ratio) + mem_ratio * (hit_rate * hit_time +

(1 - hit_rate) * miss_time)

AVDARK

CPI = 1 -mem_ratio + mem_ratio * (hit_rate * hit_time +

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh61

AVDARK2011 (1 - hit_rate) * miss_time)

Example NumbersExample Numbers

CPI = 1 - mem_ratio + mem_ratio * (hit_rate * hit_time) +mem_ratio * (1 - hit_rate) * miss_time)

mem ratio = 0 25mem_ratio = 0.25hit_rate = 0.85hit time = 3hit_time 3miss_time = 100

AVDARK

CPI = 0.75 + 0.25 * 0.85 * 3 + 0.25 * 0.15 * 100 =

0 75 + 0 64 + 3 75 = 5 14

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh62

AVDARK2011

0.75 + 0.64 + 3.75 = 5.14CPU HIT MISS

What if ...

CPI = 1 -mem_ratio + mem ratio * (hit rate * hit time) +mem_ratio (hit_rate hit_time) +mem_ratio * (1 - hit_rate) * miss_time)

mem_ratio = 0.25hit_rate = 0.85hit ti 3 CPU HIT MISShit_time = 3 CPU HIT MISSmiss_time = 100 == > 0.75 + 0.64 + 3.75 = 5.14

•Twice as fast CPU ==> 0.37 + 0.64 + 3.75 = 4.77

AVDARK

•Faster memory (70c) ==> 0.75 + 0.64 + 2.62 = 4.01

I hit t (0 95) > 0 75 + 0 71 + 1 25 2 71

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh63

AVDARK2011

•Improve hit_rate (0.95) => 0.75 + 0.71 + 1.25 = 2.71

H t t ff ti hHow to get more effective caches:

Larger cache (more capacity)Cache block size (larger cache lines)( g )More placement choice (more associativity)Innovative caches (victim, skewed, …)Innovative caches (victim, skewed, …)Cache hierarchies (L1, L2, L3, CMR)Latency-hiding (weaker memory models)Latency hiding (weaker memory models)Latency-avoiding (prefetching)Cache avoiding (cache bypass)

AVDARK

Cache avoiding (cache bypass)Optimized application/compiler

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh64

AVDARK2011 …

Wh d i i hWhy do you miss in a cache

Mark Hill’s three “Cs”Compulsory miss (touching data for the first time)Capacity miss (the cache is too small)Conflict misses (non-ideal cache implementation)(too many names starting with “H”)

(Multiprocessors)Communication (imposed by communication)

AVDARK

False sharing (side-effect from large cache blocks)

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh65

AVDARK2011

Avoiding Capacity Misses –a huge address booka huge address bookLots of pages. One entry per page.

ÖU

ÖT

LING 12345 New

Y

X

ÖV

ÖULING 12345

Ö

ÖY

ÖX

NewIndexingfunction

Ö

Ä

Å

Z

ÖÖ

ÖÄ

ÖÅ

ÖZfunction

“Address Tag” “Data”

AVDARK

Address Tag

One entry per page =>

Data

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh66

AVDARK2011 Direct-mapped caches with 282 entries 784 entries

Page 12: Goal for this course AVDARK on the web - Uppsala … computer science lab 1985-1988 ÎNetInsight, ... ≈30 APZ 212 @ 5MH30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer

Cache Organization1MB di d1MB, direct mapped

32 bit address Identifies the byte within a word00100110000101001010011010100011

(18)

32 bit address Identifies the byte within a word

msb lsb

index1

256k entries

(18)0101001 0010011100101

Mem (12) (12)

Mem Overhead:13/32= 40%

=

Addrtag

(1)Valid

13/32= 40%

Latency =

AVDARK(32)&

(1)Latency =SRAM+CMP+AND

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh67

AVDARK2011

Hit?(1) Data

Pros/Cons Large Caches++ The safest way to get improved hit rate

SRAM i !!-- SRAMs are very expensive!!-- Larger size ==> slower speed

more load on “signals”longer distanceslonger distances

-- (power consumption)( li bilit )

AVDARK

-- (reliability)

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh68

AVDARK2011

Why do you hit in a cache?Temporal locality

Lik l t th d t i Likely to access the same data again soon

Spatial localityLikely to access nearby data again soon

Typical access pattern:(inner loop stepping through an array) A, B, C, A+1, B, C, A+2, B, C, ...

AVDARK

, , , , , , , , ,

temporal spatial

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh69

AVDARK2011

temporal spatial

Fetch more than a word: cache blocks(a.k.a cache line)

Identifies the word within a cache lineIdentifies a byte within a word

cac e b oc s(a a cac e e)1MB, direct mapped, CacheLine=16B

Identifies a byte within a word00100110000101001010011010100011

(16)msb lsb

1

64k entries

0101001 0010011100101(16)index

0010011100101 0010011100101 0010011100101

Mem entries(12) (12)

(32) (32) (32) (32)Overhead:13/128= 10%

Multiplexer(2)= 128 bitsLatency =SRAM+CMP+AND

AVDARK (32)

Multiplexer(4:1 mux)

( )Select

&(1)

8 b sSRAM+CMP+AND

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh70

AVDARK2011

(32)DataHit?(1)

Example in ClasspDirect mapped cache:

Cache size = 64 kB C h li 16 BCache line = 16 BWord size = 4B32 bits address (byte addressable)

“There are 10 kinds of people in the world:Th h d t d bi b d

AVDARK

Those who understand binary number and those who do not.”

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh71

AVDARK2011

Pros/Cons Large Cache Lines/ g++ Explores spatial locality++ Fits well with modern DRAMs++ Fits well with modern DRAMs

* first DRAM access slow* subsequent accesses fast (“page mode”)

-- Poor usage of SRAM & BW for some patternsPoor usage of SRAM & BW for some patterns-- Higher miss penalty (fix: critical word first)

(False sharing in multiprocessors)-- (False sharing in multiprocessors)

P f

AVDARK

Perf

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh72

AVDARK2011

Cache line size

Page 13: Goal for this course AVDARK on the web - Uppsala … computer science lab 1985-1988 ÎNetInsight, ... ≈30 APZ 212 @ 5MH30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer

UART: StatCache Graph app=matrix multiplyapp matrix multiply

Small cache:Short cache lines are better

Large cachesLarge cachesLonger cache lines are better

Huge caches:Everything fits regardless of CL size

AVDARK

Everything fits regardless of CL size

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh73

AVDARK2011 Thanks: Dr. Erik Berg

Note: this is just a single example, butthe conclusion typically holds for mostapplications.

Cache ConflictsTypical access pattern:

(inner loop stepping through an array) (inner loop stepping through an array) A, B, C, A+1, B, C, A+2, B, C, …

temporal spatial

What if B and C index to the same cache location

AVDARK

What if B and C index to the same cache locationConflict misses -- big time!Potential performance loss 10-100x

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh74

AVDARK2011

p

Address Book CacheTwo names per page: index first, then search.

TOMMY

UT

OMAS

TOMMYindex

23457

ZYXVUOMAS

EQ?

23457

ÖÄÅZ

12345OMMY

EQ?EQ?

AVDARK

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh75

AVDARK2011

Avoiding conflict: More associativityg y1MB, 2-way set-associative, CL=4B

Identifies a byte within a word

1

00100110000101001010011010100011

(17)0101001 0010011100101 1 0101001 0010011100101

msb lsb

index1

128k “sets”

0101001 0010011100101 1 0101001 0010011100101

(13) (13) One “set”

(32) (32)= =

Multiplexer&(1)

&

AVDARK H h ld th (32)

(2:1 mux)

Hit?

(1)Select“logic”Latency =

SRAM+CMP+AND+

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh76

AVDARK2011

How should the select signal be

produced?

DataLOGIC+MUX

Pros/Cons Associativity

++ Avoids conflict misses-- Slower access time

More complex implementation-- More complex implementationcomparators, muxes, ...

-- Requires more pins (for external SRAM )

AVDARK

SRAM…)

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh77

AVDARK2011

Going all the way…!1MB f ll i i CL 16B1MB, fully associative, CL=16B

Identifies a byte within a wordIdentifies the word within a cache line

00100110000101001010011010100011One “set”

(0)

Identifies a byte within a word

(2)index

1(0)

(28)0101001 0010011100101

(13)

1 0101001 0010011100101 1 0101001 0010011100101 1

= 16B 16B= = = 16B...

& Multiplexer& & &64kcomparators

AVDARK Hit? 4B

u p e e(256k:1 mux)

Select (16)“logic”

p

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh78

AVDARK2011

Hit? 4BData

Page 14: Goal for this course AVDARK on the web - Uppsala … computer science lab 1985-1988 ÎNetInsight, ... ≈30 APZ 212 @ 5MH30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer

Fully Associative

Very expensiveOnly used for small caches (and sometimes TLBs)so et es s)

CAM = Contents-addressable memory~Fully-associative cache storing key+data

AVDARK

y g yProvide key to CAM and get the associated data

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh79

AVDARK2011

data

A combination thereof1MB, 2-way, CL=16B1MB, 2 way, CL=16B

Identifies a byte within a wordIdentifies the word within a cache line

001001100001010010100110101000110

(15)

de es by e w wo d

msb lsb

index1

32k “sets”

( )

(13)0101001 0010011100101

(13)

1 0101001 0010011100101

=sets

(128) (128)=

& Multiplexer&(2)

(256)

AVDARK Hit? (32)

(8:1 mux)(1)Select“logic”

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh80

AVDARK2011

Hit? ( )Data

Example in Class

Cache size = 2 MBCache line = 64 BWord size = 8B (64 bits)Word size = 8B (64 bits)4-way set associative32 bits address (byte addressable)

AVDARK

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh81

AVDARK2011

Who to replace?pPicking a “victim”

L t tl d ( k LRU)Least-recently used (aka LRU)Considered the “best” algorithm (which is not always true )always true…)Only practical up to limited number of ways

Not most recently usedNot most recently usedRemember who used it last: 8-way -> 3 bits/CL

Pseudo LRUPseudo-LRUE.g., based on course time stamps. Used in the VM system

AVDARK

Used in the VM systemRandom replacement

Can’t continuously to have “bad luck

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh82

AVDARK2011

Can t continuously to have bad luck...

Cache Model: Random vs. LRU

LRURandom

Random

LRU

AVDARK

art (SPEC 2000) equake (SPEC 2000)

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh83

AVDARK2011

4-way sub-blocked cache1MB, direct mapped, Block=64B, sub-block=16B1MB, direct mapped, Block=64B, sub block=16B

Identifies a byte within a wordIdentifies the word within a cache line

Sub block within a block

1

00100110000101001010011010100011

(14)0101001 0010011100101 00100111001010 1 0

msb lsb

index1

16k 0101001 0010011100101 00100111001010 1 0

Mem

=

(12) (12)(128) (128) (128) (128)Overhead:

16/512= 3% =

& 16:1 mux(2)

512 bits& & &

AVDARK

(4)

(32)

& 16:1 mux& & &

logic4:1 mux(2)

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh84

AVDARK2011 Hit? Data

Page 15: Goal for this course AVDARK on the web - Uppsala … computer science lab 1985-1988 ÎNetInsight, ... ≈30 APZ 212 @ 5MH30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer

Pros/Cons Sub-blocking

++ Lowers the memory overhead++ (Avoids problems with false sharing -- MP)++ Avoids problems with bandwidth waste ++ Avoids problems with bandwidth waste -- Will not explore as much spatial locality

Still poor utilization of SRAM -- Still poor utilization of SRAM -- Fewer sparse “things” allocated

AVDARK

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh85

AVDARK2011

Replacing dirty cache linesWrite-back

W it di t d t b k t ( t l l) t Write dirty data back to memory (next level) at replacementA “dirty bit” indicates an altered cache lineA dirty bit indicates an altered cache line

Write-through l i h h h l l ( ll)Always write through to the next level (as well)

data will never be dirty no write-backs

AVDARK

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh86

AVDARK2011

Write Buffer/Store Buffer/

Do not need the old value for a storeDo not need the old value for a store

cachecache

WB:==WB:

stores loads

==

CPU

AVDARKOne option: Write around (no write allocate

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh87

AVDARK2011

p (in caches) used for lower level smaller caches

Innovative cache: Victim cache

address

Cacheaddress

hit Memory

CPU datahit Memory

dd

data

VCaddresshit

Victim Cache (VC): a small fairly associative cache (~10s of entries)data (a word)

AVDARK

Victim Cache (VC): a small, fairly associative cache (~10s of entries)Lookup: search cache and VC in parallelCache replacement: move victim to the VC and replace in VC

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh88

AVDARK2011

p pVC hit: swap VC data with the corresponding data in Cache“A second life ☺”

Skewed Associative CacheSkewed Associative CacheA, B and C have a three-way conflict

2 way 4-way 2-way skewed2-wayAB

4 wayAB

2 way skewedAB

C C C

It has been shown that 2-way skewed performs roughly the same as 4-way caches

AVDARK

roughly the same as 4 way caches

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh89

AVDARK2011

Skewed-associative cache:Different indexing functionsDifferent indexing functions

32 bit address Identifies the byte within a word

1

00100110000101001010011010100011

(17)0101001 0010011100101f(>18)

msb lsb

index1

128k entries(13)

0101001 0010011100101

1 0101001 0010011100101

f1

f

( 18)

(>18) (17)1 0101001f2

= = (32) (32)

AVDARK

& 2:1mux&

function (32)

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh90

AVDARK2011

function (32)

Page 16: Goal for this course AVDARK on the web - Uppsala … computer science lab 1985-1988 ÎNetInsight, ... ≈30 APZ 212 @ 5MH30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer

UART: Elbow cacheUART: Elbow cacheIncrease “associativity” when needed

If severe conflict:make roomConflict!!

AB

AB

AB

CC

Performs roughly the same as an 8-way cacheSlightly faster

AVDARK

Slightly fasterUses much less power!!

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh91

AVDARK2011

Topology of caches: Harvard Arch

CPU needs a new instruction each cycleCPU needs a new instruction each cycle25% of instruction LD/ST D t d I t h diff t ttData and Instr. have different access patterns

==> Separate D and I first level cache==> Unified 2nd and 3rd level caches

AVDARK

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh92

AVDARK2011

Cache Hierarchy of TodayCache Hierarchy of Today

DRAM Memory

L3$ Off-chip SRAM (today more commonly on chip)

L2$

$

Use whatever transistors there

commonly on-chip)

L1 D$hi

$on-chip

L1 I$hi

i left for the Last-Level Cache (LLC)

AVDARKCPU

on-chip on-chip

Small enough to keep up with the CPU speed

Separate I cache to allow for instruction fetch and data fetch in

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh93

AVDARK2011

U p d and data fetch in parallel

HW prefetching ...a little green man that anticipates your next

memory access and prefetches the data to y pthe cache.

Improves MLP!Sequential prefetching: Sequential streams [to a page]. Some number of prefetch streams supported Often only for L2 and L3 streams supported. Often only for L2 and L3. PC-based prefetching: Detects strides from the same PC Often also for L1

AVDARK

the same PC. Often also for L1. Adjacent prefetching: On a miss, also bring in the “next” cache line. Often only for L2 and

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh94

AVDARK2011

yL3.

Hardware prefetchingHardware ”monitor” looking for patterns in memory accessesmemory accessesBrings data of anticipated future accesses into the cache prior to their usageinto the cache prior to their usageTwo major types:

Sequential prefetching (typically page-based 2nd Sequential prefetching (typically page-based, 2nd level cache and higher). Detects sequential cache lines missing in the cache.

AVDARK

PC-based prefetching, integrated with the pipeline. Finds per-PC strides. Can find more complicated patterns

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh95

AVDARK2011

patterns.

Cache Capacity/Latency/BW

AVDARK

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh96

AVDARK2011

Page 17: Goal for this course AVDARK on the web - Uppsala … computer science lab 1985-1988 ÎNetInsight, ... ≈30 APZ 212 @ 5MH30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer

Cache implementation Cacheline, here 64B: Cache implementationAT S Data = 64B

Generic Cache:

Addr [63..0]MSB LSB SRAM:

Caches at all level roughly work like this:

L3 € 24MB

index

L2 $256kB

==

index

256kB

D1 ¢

...

=======

I1 ¢

AVDARK

D1 ¢64kB

muxHit

Sel way ”6”

I1 ¢64kB

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh97

AVDARK2011

Data = 64B

Address Book AnalogygyTwo names per page: index first, then search.

TOMMY

UT

OMAS

TOMMYindex

23457

ZYXVUOMAS

EQ?

23457

ÖÄÅZ

12345OMMY

EQ?EQ?

AVDARK

Select the second entry!

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh98

AVDARK2011

Cache lingogCacheline: Data chunk move to/from a cache Cache set: Fraction of the cache identified by the indexyAssociativity: Number of alternative storage places for a cachelineReplacement policy: picking the victim to throw out Replacement policy: picking the victim to throw out from a set (LRU/Random/Nehalem) Temporal locality: Likelihood to access the same data p yagain soonSpatial locality: Likelihood to access nearby data again soonsoon

Typical access pattern:(inner loop stepping through an array)

AVDARK

(inner loop stepping through an array) A, B, C, A+4, B, C, A+8, B, C, ...

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh99

AVDARK2011

temporal spatial

Cache Lingo Picture Cacheline, here 64B: Cache Lingo PictureAT S Data = 64B

Generic Cache:

Addr [63..0]MSB LSB SRAM:

Cache Set:

tag index

CAssociativity= 8-way

...

AVDARK

muxHit

Sel way ”6”

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh100

AVDARK2011

Data = 64B

E l N h l i7 ( l )Exempel Nehalem i7 (one example)

AVDARK

32kB8-way pLRU

256kB8-way pLRUnon-incl

8MB16-way Neh.-replnon-incl

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh101

AVDARK2011

non incl non incl

Take-away message: CachesTake away message: Caches

Cache are fast but smallCache space in cache-line chunks (~64bytes) LSB part of the address is used to find the LSB part of the address is used to find the ”set” (aka, indexing)Th i li it d b f h li There is a limited number of cache lines per set (associativity)Typically, several levels of caches

AVDARK

The most important target for optimizations

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh102

AVDARK2011

Page 18: Goal for this course AVDARK on the web - Uppsala … computer science lab 1985-1988 ÎNetInsight, ... ≈30 APZ 212 @ 5MH30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer

How are we doing?g

Create and explore locality:Create and explore locality:a) Spatial locality b) Temporal localityb) Temporal localityc) Geographical locality

Create and explore parallelismCreate and explore parallelisma) Instruction level parallelism (ILP)b) Thread level parallelism (TLP)c) Memory level parallelism (MLP)

AVDARK

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh103

AVDARK2011