Upload
doannhi
View
216
Download
3
Embed Size (px)
Citation preview
Welcome to AVDARK
E ik H tErik HagerstenUppsala Universitypp y
Ericsson, CPU designer 1982-84: APZ212MIT 1984-85: Dataflow parallel architectureEricsson computer science lab 1985-1988 NetInsight, (Erlang)SICS, parallel architectures 1988-1993 COMA, (Virtutech)S h f h 993 999Sun Microsystems, chief architect servers 1993 – 1999
WildFire, E6000, E15000, E25000Professor Uppsala University 1999 – New modeling + Acumemo o Upp a a U y od g uStartup Acumem 2006 – 2010 ThreadSpotterChief scientist Rogue Wave Software 2011 –
AVDARK
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh2
AVDARK2011
2
Goal for this courseUnderstand how and why modern computer systems are designed the way the are:
pipelinesmemory organizationvirtual/physical memory ...
Understand how and why multiprocessors are builtCache coherenceMemory modelsSynchronization…
Understand how and why parallelism is created andInstruction-level parallelismMemory-level parallelismThread-level parallelism…
Understand how and why multiprocessors of combined SIMD/MIMD type are builtGPUVector processing…
Understand how computer systems are adopted to different usage areas
AVDARK
General-purpose processorsEmbedded/network processors…
Understand the physical limitation of modern computers
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh3
AVDARK2011
BandwidthEnergyCooling…
AVDARK on the webAVDARK on the web
www it uu se/edu/course/homepage/avdark/ht11www.it.uu.se/edu/course/homepage/avdark/ht11
Welcome!Welcome!NewsFAQQScheduleSlidesN PNew PapersAssignmentsReading instr 4:ed
AVDARK
Reading instr 4:edExam
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh4
AVDARK2011
AVDARK in a nutshellLiterature Computer Architecture A Quantitative Approach (4th edition)
John Hennesey & David Pattersson
Lecturer Erik Hagersten gives most lectures and is responsible for the courseAndreas Sandberg is responsible for the labs and the hand-ins.Jakob Carlström from Xelerated will teach network processorsJakob Carlström from Xelerated will teach network processors.Sverker Holmgren will teach parallel programming.David Black-Schaffer will teach about graphics processors.
Mandatory Assignment
There are four lab assignments that all participants have to complete before a hard deadline. Each can earn you a bonus point
Optional Assignment
There are four (optional) hand-in assignments. Each can earn you a bonus point
AVDARK
Examination Written exam at the end of the course. No books are allowed.
Bonus system 64p max/32p to pass For each bonus point there is a
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh5
AVDARK2011
Bonus system 64p max/32p to pass. For each bonus point, there is a corresponding question 4p bonus question. Full bonus Pass.
Schedule in a nutshellSchedule in a nutshell1. Memory Systems (~Appendix C in 4th Ed)
Caches, VM, DRAM, microbenchmarks, optimizing SW
2. Multiprocessors2. MultiprocessorsTLP: coherence, memory models, interconnects, scalability, clusters, …
3 Scalable Multiprocessors3. Scalable MultiprocessorsScalability, synchornization, clusters, …
4. CPUsILP: pipelines, scheduling, superscalars, VLIWs, Vecotor instructions…
AVDARK
5. Widening + Future (~Chapter 1 in 4th Ed)Technology impact, GPUs, Network processors, Multicores (!!)
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh6
AVDARK2011
Exam and bonus structureExam and bonus structure4 Mandatory labs4 Mandatory labs4 Hand-in (optional)Written Exam
How to get a bonus point:Complete extra bonus activity at lab occationComplete extra bonus activity at lab occationComplete optional bonus hand-in [with a reasonable accuracy] befor a hard dealine
AVDARK
reasonable accuracy] befor a hard dealine
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh7
AVDARK2011 32p/64p at the exam = PASS
Crash Course inCrash Course inComputer ArchitectureComputer Architecture
(covering the course in 45 min)
E ik H tErik HagerstenUppsala Universitypp y
≈30 APZ 212 @ 5MH≈30 years ago: APZ 212 @ 5MHz”the AXE supercomputer”
AVDARK
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh9
AVDARK2011
APZ 212APZ 212marketing brochure quotes:
”Very compact”6 times the performancep1/6:th the size1/5 the power consumption
”A b kth h i t i ””A breakthrough in computer science””Why more CPU power?””All the power needed for future development”All the power needed for future development”…800,000 BHCA, should that ever be needed””SPC computer science at its most elegance”
AVDARK
SPC computer science at its most elegance”Using 64 kbit memory chips””1500W power consumption
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh10
AVDARK2011
1500W power consumption
CPU ImprovementspRelative Performance[log scale]
1000
100
1010
AVDARK
1970 1980 1990 2000 Year
1
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh11
AVDARK2011
How to get efficient architectures…g
Higher clock rateHigher clock rateCreate and explore locality:
) S ti l l lit a) Spatial locality b) Temporal locality
G hi l l litc) Geographical locality
Create and explore parallelismp pa) Instruction level parallelism (ILP)b) Thread level parallelism (TLP)
AVDARK
) p ( )c) Memory level parallelism (MLP)
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh12
AVDARK2011
Load/Store architecture (e.g., ”RISC”)Load/Store architecture (e.g., RISC )ALU ops: Reg -->RegMem ops: Reg <--> MemMem ops: Reg < > Mem
ExplicitRegisters
Mem
RegistersLoad/Store
Three Regs:Source1
ALU
Source1Source2DestinationLoad R1, [A]
AVDARK
MemoryAccesses
Example: C = A + B, [ ]
Load R3, [B]Add R2, R1, R3S R2 [C]
Compiler
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh13
AVDARK2011
AccessesStore R2, [C]
Lifting the CPU hood (simplified…)Lifting the CPU hood (simplified…)
DInstructions:
DCBAA
CPU
Mem
AVDARK
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh14
AVDARK2011
PipelinePipeline
DInstructions:
DCBAA
I R X WRegs
Mem
Regs
AVDARK
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh15
AVDARK2011
PipelinePipeline
AA
I R X WRegs
Mem
Regs
AVDARK
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh16
AVDARK2011
PipelinePipeline
AA
I R X WRegs
Mem
Regs
AVDARK
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh17
AVDARK2011
PipelinePipeline
AA
I R X WRegs
Mem
Regs
AVDARK
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh18
AVDARK2011
Pipeline:Pipeline:
AA
I R X WRegs
I = Instruction fetchR = Read registerX = ExecuteW= Write register/mem
Mem
Regs W= Write register/mem
AVDARK
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh19
AVDARK2011
Example ALU operation:
R1 := R2 op R3(ADD R1, R2, R3)
AA OPe.g., +, -, *, /Ifetch
I R X WRegs
23 1
I = Instruction fetchR = Read registerX = ExecuteW= Write register/mem
Mem
Regs W= Write register/mem
AVDARK
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh20
AVDARK2011
InitiallyInitially
D IF RegC < 100 GOTO ADCBA LD RegA (100 + RegC)
IF RegC < 100 GOTO A
RegB := RegA + 1
RegC := RegC + 1
A LD RegA, (100 + RegC)
PC
I R X WRegs
Mem
Regs
AVDARK
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh21
AVDARK2011
Cycle 1Cycle 1
D IF RegC < 100 GOTO ADCBA LD RegA (100 + RegC)
IF RegC < 100 GOTO A
RegB := RegA + 1
RegC := RegC + 1
PC A LD RegA, (100 + RegC)PC
I R X WRegs
Mem
Regs
AVDARK
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh22
AVDARK2011
Cycle 2Cycle 2
D IF RegC < 100 GOTO ADCBA LD RegA (100 + RegC)
IF RegC < 100 GOTO A
RegB := RegA + 1
RegC := RegC + 1
PCA LD RegA, (100 + RegC)
I R X WRegs
Mem
Regs
AVDARK
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh23
AVDARK2011
Cycle 3Cycle 3
D IF RegC < 100 GOTO ADCBA LD RegA (100 + RegC)
IF RegC < 100 GOTO A
RegB := RegA + 1
RegC := RegC + 1 PC
A LD RegA, (100 + RegC)
+
I R X WRegs
+
Mem
Regs
AVDARK
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh24
AVDARK2011
Cycle 4Cycle 4
D IF RegC < 100 GOTO APC DCBA LD RegA (100 + RegC)
IF RegC < 100 GOTO A
RegB := RegA + 1
RegC := RegC + 1 PC
A LD RegA, (100 + RegC)
+I R X WRegs
+
Mem
Regs
AVDARK
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh25
AVDARK2011
Cycle 5Cycle 5
D IF RegC < 100 GOTO ADCBA LD RegA (100 + RegC)
IF RegC < 100 GOTO A
RegB := RegA + 1
RegC := RegC + 1
PC A LD RegA, (100 + RegC)PC
I R X WRegs
Mem
Regs
AVDARK
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh26
AVDARK2011
Data dependency Previous execution example wrong! Previous execution example wrong!
D IF RegC < 100 GOTO ADCBA LD RegA (100 + RegC)
IF RegC < 100 GOTO A
RegB := RegA + 1
RegC := RegC + 1
A LD RegA, (100 + RegC)
I R X WRegs
Mem
Regs
AVDARK
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh27
AVDARK2011
Data dependency fix: pipeline delays
D IF RegC < 100 GOTO A
”Stall”
CB R B R A 1
RegC := RegC + 1
”Stall”
PC B RegB := RegA + 1
”Stall”
”Stall”
PC
A
I R X W
LD RegA, (100 + RegC)
AVDARK
I R X WRegs
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh28
AVDARK2011 Mem
Branch Delay: Each Iteration takes 11c
”Stall”
”Stall”PC
D IF RegC < 100 GOTO A
”Stall”
”Stall”
CB R B R A 1
RegC := RegC + 1
”Stall”
B RegB := RegA + 1
”Stall”
”Stall”
A
I R X W
LD RegA, (100 + RegC)
Need to find Instruction-
AVDARK
I R X WRegs PC := next instruction address
Need to find InstructionLevel Parallelism (ILP) to avoid stalls!
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh29
AVDARK2011 Mem
It is actually a lot worse!yModern CPUs:”superscalars” with ~4 parallel pipelines
I R X W
I R X W
I R X W
Issue
I R X W
I R X Wlogic
Mem
Regs
N d t fi d 4 ILPMem
+Higher throughputMore complicated architecture
Need to find 4x more ILP
AVDARK
- More complicated architecture- Branch delay more expensive (more instr. missed)- Harder to find ”enough” independent instr. (need 8
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh30
AVDARK2011
Harder to find enough independent instr. (need 8 instr. between write and use)
It is actually a lot worse:Modern CPUs: ~10-20 pilepline stagesModern CPUs: ~10-20 pilepline stages
I R B M MWI
I R B M MW
I R B M MW
I
IIssuelogic
I R
I R B M MW
I
Ilogic
M
RegsMore pipeline stages can run at higher freq ☺
MemBut will need to find ~10x more ILP
AVDARK
+Shorter cycletime (higher GHz)- Branch delay even more expensive
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh31
AVDARK2011 - Even harder to find ”enough” independent instr.
Fix: Out-of order execution:Improving ILPImproving ILP
LD R1, M(100)ADD R3, R2, R1
The HW may execute instructions ina different order, but will make the ”side-effects” of the instructions appear in order.
SUB R5, R6, R7ST R5, M(100)
instructions appear in order.
Assume that LD takes a long time.The ADD is dependent on the LD Start the SUB and ST before the ADDUpdate R5 and M(100) after R3
>=0?YN
Update R5 and M(100) after R3
LD ...ADD ...SUB ...ST
AVDARK
ST ...
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh32
AVDARK2011
Fix: Branch prediction
LD R1, M(100)ADD R3, R2, R1
The HW can guess if the branch istaken or not and avoid branch stalls
SUB R5, R6, R7ST R5, M(100)
if the guess is correct,
Assume the guess is ”Y”.
>=0?YN
The HW can start to execute theseinstruction before the outcome the the branch is known, but cannot allow any ”side-effect” t t k l til th t i kto take place until the outcome is known
LD ...ADD ...SUB ...ST
AVDARK
ST ...
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh33
AVDARK2011
Fix: Scheduling Past BranchesImproving ILPImproving ILP
LDADD
P edicted pathSUBST
”Predict taken”All instructions along thepredicted path can be
Predicted path
>=0?LDADDSUBST
YPredict taken predicted path can be
executedout-of-order
>1?
Y
”Predict taken”
LDADDSUB
LDADDSUBST
AVDARK
SUSTST
Y”Predict taken”
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh34
AVDARK2011 <2?=0?
Fix: Scheduling Past BranchesImproving ILP
LDADD
Act al path!
Improving ILP
SUBST
Actual path!
>=0?LDADDSUBST
Y
>1?
YWrong Prediction!!!
LDADDSUB
LDADDSUBST
Throw awayi.e., no sideeffects!
AVDARK
SUSTST
Y
effects!
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh35
AVDARK2011 <2?=0?
It is actually a lot worse:Modern MEM: ~200 CPU cyclesModern MEM: ~200 CPU cycles
I R B M MWI
I R B M MW
I R B M MW
I
IIssuelogic
I R
I R B M MW
I
Ilogic
Regs
200 cycles Need more ILP again!
AVDARK
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh36
AVDARK2011 Mem
Fix: Use cache(s)Fix: Use cache(s)
I R B M MWI
I R B M MW
I R B M MW
I
IIssuelogic
I R B M MW
I R B M MW
I
Ilogic
RegsStill need some more ILP, but also
<< 1GB$1-30 cycles
,Memory-Level Parallelism (MLP)
AVDARK 200 l 1GB
$
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh37
AVDARK2011 Mem200cycles 1GB
Woops, using too much power 2007
Running at 2x the frequency will use h th 2 th much more than 2x the power
It is also really hard to find enough ILP
AVDARK
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh38
AVDARK2011
Fix: Multicore
MMem
Mem I/F
ExternalI/F
But now we also need to find Thread-Level Parallelism (TLP)
$1 $1 $1 $1
L2$
AVDARK
CPU CPU CPU CPU
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh39
AVDARK2011 t treads
Example: Intel i7 ”Nehalem”Example: Intel i7 Nehalem
Coherence Coherence
DRAM
AVDARK
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh40
AVDARK2011
How is the silicon used (i7-Ex)?
3 Mbyte
AVDARK
Tag
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh41
AVDARK2011
Data array
Source: JSSC Jan 2010, Rusu et. al
How is the silicon used?Quick Path Interconnect
L3Ex
Ord
Cache L1 + L2Transl. $Branch $
Quick Path Interconnect
Memory Interface
Ex
Cache L1 + L2Transl. $Branch $
AVDARK
Tag
I decodeRe order
Re-orderBranch $
L3 Cache
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh42
AVDARK2011
Data array
Source: JSSC Jan 2010, Rusu et. al
I-decodeRe-order
Example: Intel i7 ”Nehalem”Example: Intel i7 NehalemCoherence = NUMA + funky memory models
Coherence Coherence
DRAMI/F
100
I/F
AVDARK
DRAM
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh43
AVDARK2011
What is computer architecture?
“Bridging the gap between programs and t i t ”transistors”
“Finding the best model to execute the programs” programs”
AVDARK
best={fast, cheap, energy-efficient, reliable, predictable, …}
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh44
AVDARK2011 …
Caches and more cachesCaches and more cachesor
spam, spam, spam and spam
Erik HagerstenErik HagerstenUppsala University, Sweden
Fix: Use a cacheFix: Use a cache
I R B M MWI
I R B M MW
I R B M MW
I
IIssuelogic
I R B M MW
I R B M MW
I
Ilogic
Regs
~32kB$~1 cycles
AVDARK 200 l 1GB
$
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh46
AVDARK2011 Mem200cycles 1GB
Webster about “cache”
1. cache \'kash\ n [F, fr. cacher to press, hide, fr. (assumed) VL coacticare to press] together fr L coactare to compel frcoacticare to press] together, fr. L coactare to compel, fr. coactus, pp. of cogere to compel - more at COGENT 1a: ahiding place esp. for concealing and preserving provisions or g p p g p g pimplements 1b: a secure place of storage 2: something hidden or stored in a cache
AVDARK
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh47
AVDARK2011
Cache knowledge useful when...
Designing a new computerWriting an optimized program
or compilerpor operating system …
Implementing software cachingImplementing software cachingWeb cachesP o ie
AVDARK
ProxiesFile systems
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh48
AVDARK2011
Memory/storage
SRAM DRAM disk
sram
AVDARK
2000: 1ns 1ns 3ns 10ns 150ns 5 000 000ns1kB 64k 4MB 1GB 1 TB
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh49
AVDARK2011
(1982: 200ns 200ns 200ns 10 000 000ns)
Address Book CacheLooking for Tommy’s Telephone Number
UT
TOMMY 12345 Indexing
ZYXVUTOMMY 12345
ZYXV
Indexingfunction
ÖÄÅZ
ÖÄÅZ
“Address Tag” “Data”
AVDARK
Address Tag
One entry per page =>
Data
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh50
AVDARK2011 Direct-mapped caches with 28 entries
Address Book CacheLooking for Tommy’s Number
TOMMY
UT
OMMY 12345
TOMMYindex
ZYXVUOMMY 12345
ÖÄÅZ
EQ?
AVDARK
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh51
AVDARK2011
Address Book CacheLooking for Tomas’ Number
TOMAS
UT
OMMY 12345
TOMASindex
ZYXVUOMMY 12345
ÖÄÅZ
EQ?
AVDARKMiss!
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh52
AVDARK2011 Lookup Tomas’ number in
the telephone directory
Address Book CacheLooking for Tomas’ Number
TOMAS
UT
OMMY 12345
TOMASindexOMAS 23457
ZYXVUOMMY 12345
Z
ÖÄÅ
Replace TOMMY’s datawith TOMAS’ data
AVDARK
with TOMAS’ data. There is no other choice(direct mapped)
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh53
AVDARK2011
(direct mapped)
CacheCacheaddress
haddress Memory
dd ess
CPUCache
hit
y
data (a word)
data
AVDARK
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh54
AVDARK2011
Cache Organizationg
CacheCache
OMAS 23457
TOMAS(1)
1 OMAS 23457index
(4) (4)
1
28 entries
(4) (4)Addrtag
=(1)Valid
AVDARK
&
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh55
AVDARK2011 Hit (1) Data (5 digits)
Cache Organization (really)4kB, direct mapped Ordinary
Memory
1
00100110000101001010011010100011
(?)0101001 0010011100101
ymsb lsb
index1
1k entries of 4 bytes each
(?)
0101001 0010011100101
32 bit address
”Byte”
What is a good
Addr
(?)(?)identifying
a byte in memory
a good index
function
=
Addrtag
(1)Valid
AVDARK ( )(32)&
(1)
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh56
AVDARK2011 Hit?(1) Data
Cache Organization4kB di d4kB, direct mapped
32 bit address Identifies the byte within a word00100110000101001010011010100011
(10)
32 bit address Identifies the byte within a word
msb lsb
index1
1k entries of 4 bytes each
(10)0101001 0010011100101
each(20) (20)Mem
O h d
=
Addrtag
(1)ValidOverhead:
21/32= 66%
AVDARK(32)&
(1)Latency =SRAM+CMP+AND
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh57
AVDARK2011
Hit?(1) Data
CacheCacheaddress
haddress Memory
dd ess
CPUCache
hit
y
data (a word)
data
Hi U h d id d f h h
AVDARK
Hit: Use the data provided from the cache~Hit: Use data from memory and also store it in
the cache
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh58
AVDARK2011
the cache
Cache performance parameters
Cache “hit rate” [%]Cache “miss rate” [%] (= 1 hit rate) Cache miss rate [%] (= 1 - hit_rate) Hit time [CPU cycles]Miss time [CPU cycles]Miss time [CPU cycles]Hit bandwidthMiss bandwidthMiss bandwidthWrite strategy
AVDARK
….
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh59
AVDARK2011
How to rate architecture f ?performance?
Marketing:Frequency / Numbe of coresFrequency / Numbe of cores…
Architecture “goodness”: CPI = Cycles Per InstructionIPC = Instructions Per Cycley
Benchmarking:SPEC f SPEC i t
AVDARK
SPEC-fp, SPEC-int, …TPC-C, TPC-D, …
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh60
AVDARK2011
Cache performance exampleCache performance exampleAssumption:
fi i b d id hInfinite bandwidthA perfect 1.0 CyclesPerInstruction (CPI) CPU100% instruction cache hit rate100% instruction cache hit rate
Total number of cycles =#Instr * ( (1 - mem ratio) * 1 +#Instr. ( (1 mem_ratio) 1 +
mem_ratio * avg_mem_accesstime) =
= #Instr * ( (1 mem ratio) += #Instr * ( (1- mem_ratio) + mem_ratio * (hit_rate * hit_time +
(1 - hit_rate) * miss_time)
AVDARK
CPI = 1 -mem_ratio + mem_ratio * (hit_rate * hit_time +
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh61
AVDARK2011 (1 - hit_rate) * miss_time)
Example NumbersExample Numbers
CPI = 1 - mem_ratio + mem_ratio * (hit_rate * hit_time) +mem_ratio * (1 - hit_rate) * miss_time)
mem ratio = 0 25mem_ratio = 0.25hit_rate = 0.85hit time = 3hit_time 3miss_time = 100
AVDARK
CPI = 0.75 + 0.25 * 0.85 * 3 + 0.25 * 0.15 * 100 =
0 75 + 0 64 + 3 75 = 5 14
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh62
AVDARK2011
0.75 + 0.64 + 3.75 = 5.14CPU HIT MISS
What if ...
CPI = 1 -mem_ratio + mem ratio * (hit rate * hit time) +mem_ratio (hit_rate hit_time) +mem_ratio * (1 - hit_rate) * miss_time)
mem_ratio = 0.25hit_rate = 0.85hit ti 3 CPU HIT MISShit_time = 3 CPU HIT MISSmiss_time = 100 == > 0.75 + 0.64 + 3.75 = 5.14
•Twice as fast CPU ==> 0.37 + 0.64 + 3.75 = 4.77
AVDARK
•Faster memory (70c) ==> 0.75 + 0.64 + 2.62 = 4.01
I hit t (0 95) > 0 75 + 0 71 + 1 25 2 71
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh63
AVDARK2011
•Improve hit_rate (0.95) => 0.75 + 0.71 + 1.25 = 2.71
H t t ff ti hHow to get more effective caches:
Larger cache (more capacity)Cache block size (larger cache lines)( g )More placement choice (more associativity)Innovative caches (victim, skewed, …)Innovative caches (victim, skewed, …)Cache hierarchies (L1, L2, L3, CMR)Latency-hiding (weaker memory models)Latency hiding (weaker memory models)Latency-avoiding (prefetching)Cache avoiding (cache bypass)
AVDARK
Cache avoiding (cache bypass)Optimized application/compiler
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh64
AVDARK2011 …
Wh d i i hWhy do you miss in a cache
Mark Hill’s three “Cs”Compulsory miss (touching data for the first time)Capacity miss (the cache is too small)Conflict misses (non-ideal cache implementation)(too many names starting with “H”)
(Multiprocessors)Communication (imposed by communication)
AVDARK
False sharing (side-effect from large cache blocks)
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh65
AVDARK2011
Avoiding Capacity Misses –a huge address booka huge address bookLots of pages. One entry per page.
ÖU
ÖT
LING 12345 New
Y
X
ÖV
ÖULING 12345
Ö
ÖY
ÖX
NewIndexingfunction
Ö
Ä
Å
Z
ÖÖ
ÖÄ
ÖÅ
ÖZfunction
“Address Tag” “Data”
AVDARK
Address Tag
One entry per page =>
Data
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh66
AVDARK2011 Direct-mapped caches with 282 entries 784 entries
Cache Organization1MB di d1MB, direct mapped
32 bit address Identifies the byte within a word00100110000101001010011010100011
(18)
32 bit address Identifies the byte within a word
msb lsb
index1
256k entries
(18)0101001 0010011100101
Mem (12) (12)
Mem Overhead:13/32= 40%
=
Addrtag
(1)Valid
13/32= 40%
Latency =
AVDARK(32)&
(1)Latency =SRAM+CMP+AND
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh67
AVDARK2011
Hit?(1) Data
Pros/Cons Large Caches++ The safest way to get improved hit rate
SRAM i !!-- SRAMs are very expensive!!-- Larger size ==> slower speed
more load on “signals”longer distanceslonger distances
-- (power consumption)( li bilit )
AVDARK
-- (reliability)
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh68
AVDARK2011
Why do you hit in a cache?Temporal locality
Lik l t th d t i Likely to access the same data again soon
Spatial localityLikely to access nearby data again soon
Typical access pattern:(inner loop stepping through an array) A, B, C, A+1, B, C, A+2, B, C, ...
AVDARK
, , , , , , , , ,
temporal spatial
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh69
AVDARK2011
temporal spatial
Fetch more than a word: cache blocks(a.k.a cache line)
Identifies the word within a cache lineIdentifies a byte within a word
cac e b oc s(a a cac e e)1MB, direct mapped, CacheLine=16B
Identifies a byte within a word00100110000101001010011010100011
(16)msb lsb
1
64k entries
0101001 0010011100101(16)index
0010011100101 0010011100101 0010011100101
Mem entries(12) (12)
(32) (32) (32) (32)Overhead:13/128= 10%
Multiplexer(2)= 128 bitsLatency =SRAM+CMP+AND
AVDARK (32)
Multiplexer(4:1 mux)
( )Select
&(1)
8 b sSRAM+CMP+AND
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh70
AVDARK2011
(32)DataHit?(1)
Example in ClasspDirect mapped cache:
Cache size = 64 kB C h li 16 BCache line = 16 BWord size = 4B32 bits address (byte addressable)
“There are 10 kinds of people in the world:Th h d t d bi b d
AVDARK
Those who understand binary number and those who do not.”
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh71
AVDARK2011
Pros/Cons Large Cache Lines/ g++ Explores spatial locality++ Fits well with modern DRAMs++ Fits well with modern DRAMs
* first DRAM access slow* subsequent accesses fast (“page mode”)
-- Poor usage of SRAM & BW for some patternsPoor usage of SRAM & BW for some patterns-- Higher miss penalty (fix: critical word first)
(False sharing in multiprocessors)-- (False sharing in multiprocessors)
P f
AVDARK
Perf
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh72
AVDARK2011
Cache line size
UART: StatCache Graph app=matrix multiplyapp matrix multiply
Small cache:Short cache lines are better
Large cachesLarge cachesLonger cache lines are better
Huge caches:Everything fits regardless of CL size
AVDARK
Everything fits regardless of CL size
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh73
AVDARK2011 Thanks: Dr. Erik Berg
Note: this is just a single example, butthe conclusion typically holds for mostapplications.
Cache ConflictsTypical access pattern:
(inner loop stepping through an array) (inner loop stepping through an array) A, B, C, A+1, B, C, A+2, B, C, …
temporal spatial
What if B and C index to the same cache location
AVDARK
What if B and C index to the same cache locationConflict misses -- big time!Potential performance loss 10-100x
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh74
AVDARK2011
p
Address Book CacheTwo names per page: index first, then search.
TOMMY
UT
OMAS
TOMMYindex
23457
ZYXVUOMAS
EQ?
23457
ÖÄÅZ
12345OMMY
EQ?EQ?
AVDARK
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh75
AVDARK2011
Avoiding conflict: More associativityg y1MB, 2-way set-associative, CL=4B
Identifies a byte within a word
1
00100110000101001010011010100011
(17)0101001 0010011100101 1 0101001 0010011100101
msb lsb
index1
128k “sets”
0101001 0010011100101 1 0101001 0010011100101
(13) (13) One “set”
(32) (32)= =
Multiplexer&(1)
&
AVDARK H h ld th (32)
(2:1 mux)
Hit?
(1)Select“logic”Latency =
SRAM+CMP+AND+
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh76
AVDARK2011
How should the select signal be
produced?
DataLOGIC+MUX
Pros/Cons Associativity
++ Avoids conflict misses-- Slower access time
More complex implementation-- More complex implementationcomparators, muxes, ...
-- Requires more pins (for external SRAM )
AVDARK
SRAM…)
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh77
AVDARK2011
Going all the way…!1MB f ll i i CL 16B1MB, fully associative, CL=16B
Identifies a byte within a wordIdentifies the word within a cache line
00100110000101001010011010100011One “set”
(0)
Identifies a byte within a word
(2)index
1(0)
(28)0101001 0010011100101
(13)
1 0101001 0010011100101 1 0101001 0010011100101 1
= 16B 16B= = = 16B...
& Multiplexer& & &64kcomparators
AVDARK Hit? 4B
u p e e(256k:1 mux)
Select (16)“logic”
p
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh78
AVDARK2011
Hit? 4BData
Fully Associative
Very expensiveOnly used for small caches (and sometimes TLBs)so et es s)
CAM = Contents-addressable memory~Fully-associative cache storing key+data
AVDARK
y g yProvide key to CAM and get the associated data
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh79
AVDARK2011
data
A combination thereof1MB, 2-way, CL=16B1MB, 2 way, CL=16B
Identifies a byte within a wordIdentifies the word within a cache line
001001100001010010100110101000110
(15)
de es by e w wo d
msb lsb
index1
32k “sets”
( )
(13)0101001 0010011100101
(13)
1 0101001 0010011100101
=sets
(128) (128)=
& Multiplexer&(2)
(256)
AVDARK Hit? (32)
(8:1 mux)(1)Select“logic”
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh80
AVDARK2011
Hit? ( )Data
Example in Class
Cache size = 2 MBCache line = 64 BWord size = 8B (64 bits)Word size = 8B (64 bits)4-way set associative32 bits address (byte addressable)
AVDARK
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh81
AVDARK2011
Who to replace?pPicking a “victim”
L t tl d ( k LRU)Least-recently used (aka LRU)Considered the “best” algorithm (which is not always true )always true…)Only practical up to limited number of ways
Not most recently usedNot most recently usedRemember who used it last: 8-way -> 3 bits/CL
Pseudo LRUPseudo-LRUE.g., based on course time stamps. Used in the VM system
AVDARK
Used in the VM systemRandom replacement
Can’t continuously to have “bad luck
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh82
AVDARK2011
Can t continuously to have bad luck...
Cache Model: Random vs. LRU
LRURandom
Random
LRU
AVDARK
art (SPEC 2000) equake (SPEC 2000)
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh83
AVDARK2011
4-way sub-blocked cache1MB, direct mapped, Block=64B, sub-block=16B1MB, direct mapped, Block=64B, sub block=16B
Identifies a byte within a wordIdentifies the word within a cache line
Sub block within a block
1
00100110000101001010011010100011
(14)0101001 0010011100101 00100111001010 1 0
msb lsb
index1
16k 0101001 0010011100101 00100111001010 1 0
Mem
=
(12) (12)(128) (128) (128) (128)Overhead:
16/512= 3% =
& 16:1 mux(2)
512 bits& & &
AVDARK
(4)
(32)
& 16:1 mux& & &
logic4:1 mux(2)
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh84
AVDARK2011 Hit? Data
Pros/Cons Sub-blocking
++ Lowers the memory overhead++ (Avoids problems with false sharing -- MP)++ Avoids problems with bandwidth waste ++ Avoids problems with bandwidth waste -- Will not explore as much spatial locality
Still poor utilization of SRAM -- Still poor utilization of SRAM -- Fewer sparse “things” allocated
AVDARK
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh85
AVDARK2011
Replacing dirty cache linesWrite-back
W it di t d t b k t ( t l l) t Write dirty data back to memory (next level) at replacementA “dirty bit” indicates an altered cache lineA dirty bit indicates an altered cache line
Write-through l i h h h l l ( ll)Always write through to the next level (as well)
data will never be dirty no write-backs
AVDARK
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh86
AVDARK2011
Write Buffer/Store Buffer/
Do not need the old value for a storeDo not need the old value for a store
cachecache
WB:==WB:
stores loads
==
CPU
AVDARKOne option: Write around (no write allocate
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh87
AVDARK2011
p (in caches) used for lower level smaller caches
Innovative cache: Victim cache
address
Cacheaddress
hit Memory
CPU datahit Memory
dd
data
VCaddresshit
Victim Cache (VC): a small fairly associative cache (~10s of entries)data (a word)
AVDARK
Victim Cache (VC): a small, fairly associative cache (~10s of entries)Lookup: search cache and VC in parallelCache replacement: move victim to the VC and replace in VC
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh88
AVDARK2011
p pVC hit: swap VC data with the corresponding data in Cache“A second life ☺”
Skewed Associative CacheSkewed Associative CacheA, B and C have a three-way conflict
2 way 4-way 2-way skewed2-wayAB
4 wayAB
2 way skewedAB
C C C
It has been shown that 2-way skewed performs roughly the same as 4-way caches
AVDARK
roughly the same as 4 way caches
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh89
AVDARK2011
Skewed-associative cache:Different indexing functionsDifferent indexing functions
32 bit address Identifies the byte within a word
1
00100110000101001010011010100011
(17)0101001 0010011100101f(>18)
msb lsb
index1
128k entries(13)
0101001 0010011100101
1 0101001 0010011100101
f1
f
( 18)
(>18) (17)1 0101001f2
= = (32) (32)
AVDARK
& 2:1mux&
function (32)
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh90
AVDARK2011
function (32)
UART: Elbow cacheUART: Elbow cacheIncrease “associativity” when needed
If severe conflict:make roomConflict!!
AB
AB
AB
CC
Performs roughly the same as an 8-way cacheSlightly faster
AVDARK
Slightly fasterUses much less power!!
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh91
AVDARK2011
Topology of caches: Harvard Arch
CPU needs a new instruction each cycleCPU needs a new instruction each cycle25% of instruction LD/ST D t d I t h diff t ttData and Instr. have different access patterns
==> Separate D and I first level cache==> Unified 2nd and 3rd level caches
AVDARK
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh92
AVDARK2011
Cache Hierarchy of TodayCache Hierarchy of Today
DRAM Memory
L3$ Off-chip SRAM (today more commonly on chip)
L2$
$
Use whatever transistors there
commonly on-chip)
L1 D$hi
$on-chip
L1 I$hi
i left for the Last-Level Cache (LLC)
AVDARKCPU
on-chip on-chip
Small enough to keep up with the CPU speed
Separate I cache to allow for instruction fetch and data fetch in
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh93
AVDARK2011
U p d and data fetch in parallel
HW prefetching ...a little green man that anticipates your next
memory access and prefetches the data to y pthe cache.
Improves MLP!Sequential prefetching: Sequential streams [to a page]. Some number of prefetch streams supported Often only for L2 and L3 streams supported. Often only for L2 and L3. PC-based prefetching: Detects strides from the same PC Often also for L1
AVDARK
the same PC. Often also for L1. Adjacent prefetching: On a miss, also bring in the “next” cache line. Often only for L2 and
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh94
AVDARK2011
yL3.
Hardware prefetchingHardware ”monitor” looking for patterns in memory accessesmemory accessesBrings data of anticipated future accesses into the cache prior to their usageinto the cache prior to their usageTwo major types:
Sequential prefetching (typically page-based 2nd Sequential prefetching (typically page-based, 2nd level cache and higher). Detects sequential cache lines missing in the cache.
AVDARK
PC-based prefetching, integrated with the pipeline. Finds per-PC strides. Can find more complicated patterns
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh95
AVDARK2011
patterns.
Cache Capacity/Latency/BW
AVDARK
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh96
AVDARK2011
Cache implementation Cacheline, here 64B: Cache implementationAT S Data = 64B
Generic Cache:
Addr [63..0]MSB LSB SRAM:
Caches at all level roughly work like this:
L3 € 24MB
index
L2 $256kB
==
index
256kB
D1 ¢
...
=======
I1 ¢
AVDARK
D1 ¢64kB
muxHit
Sel way ”6”
I1 ¢64kB
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh97
AVDARK2011
Data = 64B
Address Book AnalogygyTwo names per page: index first, then search.
TOMMY
UT
OMAS
TOMMYindex
23457
ZYXVUOMAS
EQ?
23457
ÖÄÅZ
12345OMMY
EQ?EQ?
AVDARK
Select the second entry!
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh98
AVDARK2011
Cache lingogCacheline: Data chunk move to/from a cache Cache set: Fraction of the cache identified by the indexyAssociativity: Number of alternative storage places for a cachelineReplacement policy: picking the victim to throw out Replacement policy: picking the victim to throw out from a set (LRU/Random/Nehalem) Temporal locality: Likelihood to access the same data p yagain soonSpatial locality: Likelihood to access nearby data again soonsoon
Typical access pattern:(inner loop stepping through an array)
AVDARK
(inner loop stepping through an array) A, B, C, A+4, B, C, A+8, B, C, ...
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh99
AVDARK2011
temporal spatial
Cache Lingo Picture Cacheline, here 64B: Cache Lingo PictureAT S Data = 64B
Generic Cache:
Addr [63..0]MSB LSB SRAM:
Cache Set:
tag index
CAssociativity= 8-way
...
AVDARK
muxHit
Sel way ”6”
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh100
AVDARK2011
Data = 64B
E l N h l i7 ( l )Exempel Nehalem i7 (one example)
AVDARK
32kB8-way pLRU
256kB8-way pLRUnon-incl
8MB16-way Neh.-replnon-incl
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh101
AVDARK2011
non incl non incl
Take-away message: CachesTake away message: Caches
Cache are fast but smallCache space in cache-line chunks (~64bytes) LSB part of the address is used to find the LSB part of the address is used to find the ”set” (aka, indexing)Th i li it d b f h li There is a limited number of cache lines per set (associativity)Typically, several levels of caches
AVDARK
The most important target for optimizations
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh102
AVDARK2011
How are we doing?g
Create and explore locality:Create and explore locality:a) Spatial locality b) Temporal localityb) Temporal localityc) Geographical locality
Create and explore parallelismCreate and explore parallelisma) Instruction level parallelism (ILP)b) Thread level parallelism (TLP)c) Memory level parallelism (MLP)
AVDARK
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~eh103
AVDARK2011