Upload
tyshawn-abbitt
View
215
Download
2
Tags:
Embed Size (px)
Citation preview
CSE502: Computer Architecture
CSE 502:Computer Architecture
Review
CSE502: Computer Architecture
Course Overview (1/2)• Caveat 1: I’m (kind of) new here.• Caveat 2: This is a (somewhat) new course.
• Computer Architecture is… the science and art of selecting and
interconnecting hardware and softwarecomponents to create computers …
CSE502: Computer Architecture
Course Overview (2/2)• This course is hard, roughly like CSE 506
– In CSE 506, you learn what’s inside an OS– In CSE 502, you learn what’s inside a CPU
• This is a project course– Learn why things are the way they are, first hand– We will “build” emulators of CPU components
CSE502: Computer Architecture
Policy and Projects• Probably different from other classes
– Much more open, but much more strict• Most people followed the policy• Some did not
– Resembles the “real world”• You’re here because you want to learn and to be here• If you managed to get your partner(s) to do the work
– You’re probably good enough to do it at your job too» The good: You might make a good manager» The bad: You didn’t learn much
• Time mgmt. often more important than tech. skill– If you started early, you probably have an A already
CSE502: Computer Architecture
1
timeorig
f(1 - f)
timeorig
f(1 - f)
timeorig
Amdahl’s LawSpeedup = timewithout enhancement / timewith enhancement
An enhancement speeds up fraction f of a task by factor Stimenew = timeorig·( (1-f) + f/S )
Soverall = 1 / ( (1-f) + f/S )
(1 - f)
timenew
f/S(1 - f)
timenew
f/S
CSE502: Computer Architecture
The Iron Law of Processor Performance
Cycle
Time
nInstructio
Cycles
Program
nsInstructio
Program
Time
Architects target CPI, but must understand the others
Total WorkIn Program
CPI or 1/IPC 1/f (frequency)
Algorithms,Compilers,
ISA ExtensionsMicroarchitecture
Microarchitecture,Process Tech
CSE502: Computer Architecture
Averaging Performance Numbers (2/2)• Arithmetic: times
– proportional to time– e.g., latency
• Harmonic: rates– inversely proportional to time– e.g., throughput
• Geometric: ratios– unit-less quantities– e.g., speedups
n
i iTimen 1
1
n
i
iRate
n
1
1
n
n
iiRatio
1
CSE502: Computer Architecture
Power vs. Energy• Power: instantaneous rate of energy transfer
– Expressed in Watts– In Architecture, implies conversion of electricity to heat– Power(Comp1+Comp2)=Power(Comp1)+Power(Comp2)
• Energy: measure of using power for some time– Expressed in Joules– power * time (joules = watts * seconds)– Energy(OP1+OP2)=Energy(OP1)+Energy(OP2)
Wha
t use
s po
wer
in a
chi
p?
CSE502: Computer Architecture
ISA: A contract between HW and SW• ISA: Instruction Set Architecture
– A well-defined hardware/software interface
• The “contract” between software and hardware– Functional definition of operations supported by hardware– Precise description of how to invoke all features
• No guarantees regarding– How operations are implemented– Which operations are fast and which are slow (and when)– Which operations take more energy (and which take less)
CSE502: Computer Architecture
Components of an ISA• Programmer-visible states
– Program counter, general purpose registers, memory, control registers
• Programmer-visible behaviors– What to do, when to do it
• A binary encoding
if imem[pc]==“add rd, rs, rt”then pc pc+1 gpr[rd]=gpr[rs]+grp[rt]
Example “register-transfer-level”description of an instruction
ISAs last forever, don’t add stuff you don’t need
CSE502: Computer Architecture
Locality Principle• Recent past is a good indication of near future
Spatial Locality: If you looked something up, it is very likely you will look up something nearby soon
Temporal Locality: If you looked something up, it is very likely that you will look it up again soon
CSE502: Computer Architecture
Caches• An automatically managed hierarchy
• Break memory into blocks (several bytes)and transfer data to/from cache in blocks– spatial locality
• Keep recently accessed blocks– temporal locality
Core
$
Memory
CSE502: Computer Architecture
===
• Keep blocks in cache frames– data – state (e.g., valid)– address tag
datadatadata
data
Fully-Associative Cache
multiplexor
tag[63:6] block offset[5:0]
address
What happens when the cache runs out of space?
tagtagtag
tag
statestatestate
state =
063
hit?
CSE502: Computer Architecture
The 3 C’s of Cache Misses• Compulsory: Never accessed before• Capacity: Accessed long ago and already replaced• Conflict: Neither compulsory nor capacity• Coherence: (In multi-cores, become owner to write)
CSE502: Computer Architecture
Cache Size• Cache size is data capacity (don’t count tag and state)
– Bigger can exploit temporal locality better– Not always better
• Too large a cache– Smaller is faster bigger is slower– Access time may hurt critical path
• Too small a cache– Limited temporal locality– Useful data constantly replaced hi
t ra
te
working set size
capacity
CSE502: Computer Architecture
Block Size• Block size is the data that is
– Associated with an address tag – Not necessarily the unit of transfer between hierarchies
• Too small a block– Don’t exploit spatial locality well– Excessive tag overhead
• Too large a block– Useless data transferred– Too few total blocks
• Useful data frequently replacedhi
t ra
teblock size
CSE502: Computer Architecture
• Use middle bits as index• Only one tag comparison
datadatadata
tagtagtag
data tag
statestatestate
state
Direct-Mapped Cache
multiplexor
tag[63:16] index[15:6] block offset[5:0]
=deco
der
tag match(hit?)
CSE502: Computer Architecture
N-Way Set-Associative Cachetag[63:15] index[14:6] block offset[5:0]
tagtagtag
tag
multiplexor
deco
der
=
hit?
datadatadata
tagtagtag
data tag
statestatestate
state
multiplexor
deco
der
=
multiplexor
way
set
Note the additional bit(s) moved from index to tag
datadatadata
data
statestatestate
state
CSE502: Computer Architecture
Associativity• Larger associativity
– lower miss rate (fewer conflicts)– higher power consumption
• Smaller associativity– lower cost– faster hit time
~5for L1-Dhi
t ra
teassociativity
CSE502: Computer Architecture
Parallel vs Serial Caches• Tag and Data usually separate (tag is smaller & faster)
– State bits stored along with tags• Valid bit, “LRU” bit(s), …
hit?
= = = =
valid?
data
hit?
= = = =
valid?
data
enable
Parallel access to Tag and Datareduces latency (good for L1)
Serial access to Tag and Datareduces power (good for L2+)
CSE502: Computer Architecture
Physically-Indexed Caches
• Core requests are VAs
• Cache index is PA[15:6]– VA passes through TLB– D-TLB on critical path
• Cache tag is PA[63:16]
• If index size < page size– Can use VA for index
tag[63:14] index[13:6] block offset[5:0]Virtual Address
virtual page[63:13] page offset[12:0]
/ index[6:0]
/physical
tag[51:1]
physicalindex[7:0]/
= = = =
D-TLB
/physical
index[0:0]
CSE502: Computer Architecture
Virtually-Indexed Caches
• Core requests are VAs• Cache index is VA[15:6]• Cache tag is PA[63:16]
• Why not tag with VA?– Cache flush on ctx switch
• Virtual aliases– Ensure they don’t exist– … or check all on miss
tag[63:14] index[13:6] block offset[5:0]Virtual Address
virtual page[63:13] page offset[12:0]
/ virtual index[7:0]
D-TLB
/physical
tag[51:0]
= = = =
One bit overlaps
CSE502: Computer Architecture
Inclusion• Core often accesses blocks not present on chip
– Should block be allocated in L3, L2, and L1?• Called Inclusive caches• Waste of space• Requires forced evict (e.g., force evict from L1 on evict from L2+)
– Only allocate blocks in L1• Called Non-inclusive caches (who not “exclusive”?)• Must write back clean lines
• Some processors combine both– L3 is inclusive of L1 and L2– L2 is non-inclusive of L1 (like a large victim cache)
CSE502: Computer Architecture
Parity & ECC• Cosmic radiation can strike at any time
– Especially at high altitude– Or during solar flares
• What can be done?– Parity
• 1 bit to indicate if sum is odd/even (detects single-bit errors)
– Error Correcting Codes (ECC)• 8 bit code per 64-bit word• Generally SECDED (Single-Error-Correct, Double-Error-Detect)
• Detecting errors on clean cache lines is harmless– Pretend it’s a cache miss
0 1 0 1 1 0 0 1 0 1 1 0 0 1 0 1
CSE502: Computer Architecture
SRAM vs. DRAM• SRAM = Static RAM
– As long as power is present, data is retained
• DRAM = Dynamic RAM– If you don’t do anything, you lose the data
• SRAM: 6T per bit– built with normal high-speed CMOS technology
• DRAM: 1T per bit (+1 capacitor)– built with special DRAM process optimized for density
CSE502: Computer Architecture
DRAM Chip Organization• Low-Level organization is very similar to SRAM• Cells are only single-ended
– Reads destructive: contents are erased by reading• Row buffer holds read data
– Data in row buffer is called a DRAM row• Often called “page” - not necessarily same as OS page
– Read gets entire row into the buffer– Block reads always performed out of the row buffer
• Reading a whole row, but accessing one block• Similar to reading a cache line, but accessing one word
CSE502: Computer Architecture
DIMM
DRAM Organization
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
x8 DRAMDRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
Rank
Dual-rank x8 (2Rx8) DIMM
x8 DRAM
Bank
All banks within therank share all address
and control pins
x8 means each DRAMoutputs 8 bits, need 8chips for DDRx (64-bit)
All banks are independent,but can only talk to one
bank at a time
Why 9 chips per rank?64 bits data, 8 bits ECC
DRAM DRAM
CSE502: Computer Architecture
AMAT with MLP• If …
cache hit is 10 cycles (core to L1 and back)memory access is 100 cycles (core to mem and back)
• Then …at 50% miss ratio, avg. access: 0.5×10+0.5×100 = 55
• Unless MLP is >1.0, then…at 50% mr,1.5 MLP,avg. access:(0.5×10+0.5×100)/1.5 = 37at 50% mr,4.0 MLP,avg. access:(0.5×10+0.5×100)/4.0 = 14
In many cases, MLP dictates performance
CSE502: Computer Architecture
MemoryController
Memory Controller (1/2)
Scheduler Buffer
Channel 0 Channel 1
CommandsData
ReadQueue
WriteQueue
ResponseQueue
To/From CPU
CSE502: Computer Architecture
Memory Controller (2/2)• Memory controller connects CPU and DRAM• Receives requests after cache misses in LLC
– Possibly originating from multiple cores
• Complicated piece of hardware, handles:– DRAM Refresh– Row-Buffer Management Policies– Address Mapping Schemes– Request Scheduling
CSE502: Computer Architecture
Address Mapping Schemes
• Example Open-page Mapping Scheme:High Parallelism: [row rank bank column channel offset]
Easy Expandability: [channel rank row bank column offset]
• Example Close-page Mapping Scheme:High Parallelism: [row column rank bank channel offset]
Easy Expandability: [channel rank row column bank offset]
CSE502: Computer Architecture
Memory Request Scheduling• Write buffering
– Writes can wait until reads are done
• Queue DRAM commands– Usually into per-bank queues– Allows easily reordering ops. meant for same bank
• Common policies:– First-Come-First-Served (FCFS)– First-Ready—First-Come-First-Served (FR-FCFS)
CSE502: Computer Architecture
Prefetching (1/2)• Fetch block ahead of demand• Target compulsory, capacity, (& coherence) misses
– Not conflict: prefetched block would conflict
• Big challenges:– Knowing “what” to fetch
• Fetching useless blocks wastes resources
– Knowing “when” to fetch• Too early clutters storage (or gets thrown out before use)• Fetching too late defeats purpose of “pre”-fetching
CSE502: Computer Architecture
• Without prefetching:
• With prefetching:
• Or:
Prefetch
Prefetch
Prefetching (2/2)
Load
L1 L2
Data
DRAM
Total Load-to-Use Latency
DataLoad
Much improved Load-to-Use Latency
Somewhat improved Latency
DataLoad
Prefetching must be accurate and timely
time
CSE502: Computer Architecture
Next-Line (or Adjacent-Line) Prefetching• On request for line X, prefetch X+1 (or X^0x1)
– Assumes spatial locality• Often a good assumption
– Should stop at physical (OS) page boundaries
• Can often be done efficiently– Adjacent-line is convenient when next-level block is bigger– Prefetch from DRAM can use bursts and row-buffer hits
• Works for I$ and D$– Instructions execute sequentially– Large data structures often span multiple blocks
Simple, but usually not timely
CSE502: Computer Architecture
Next-N-Line Prefetching• On request for line X, prefetch X+1, X+2, …, X+N
– N is called “prefetch depth” or “prefetch degree”
• Must carefully tune depth N. Large N is …– More likely to be useful (correct and timely)– More aggressive more likely to make a mistake
• Might evict something useful
– More expensive need storage for prefetched lines• Might delay useful request on interconnect or port
Still simple, but more timely than Next-Line
CSE502: Computer Architecture
Stride Prefetching
• Access patterns often follow a stride– Accessing column of elements in a matrix– Accessing elements in array of structs
• Detect stride S, prefetch depth N– Prefetch X+1 S, X+2 S, …, X+N S∙ ∙ ∙
Column in matrix
Elements in array of structs
CSE502: Computer Architecture
“Localized” Stride Prefetchers• Store PC, last address, last stride, and count in RPT• On access, check RPT (Reference Prediction Table)
– Same stride? count++ if yes, count-- or count=0 if no– If count is high, prefetch (last address + stride*N)
PCa: 0x409A34 Load R1 = [R2]
PCb: 0x409A38 Load R3 = [R4]
PCc: 0x409A40 Store [R6] = R5
0x409
Tag Last Addr
Stride
Count
0x409
0x409
A+3N N 2
X+3N N 2
Y+2N N 1
If confidentabout the stride(count > Cmin),
prefetch(A+4N)
+
CSE502: Computer Architecture
Evaluating Prefetchers• Compare against larger caches
– Complex prefetcher vs. simple prefetcher with larger cache
• Primary metrics– Coverage: prefetched hits / base misses– Accuracy: prefetched hits / total prefetches– Timeliness: latency of prefetched blocks / hit latency
• Secondary metrics– Pollution: misses / (prefetched hits + base misses)– Bandwidth: total prefetches + misses / base misses– Power, Energy, Area...
CSE502: Computer Architecture
Before there was pipelining…
• Single-cycle control: hardwired– Low CPI (1)– Long clock period (to accommodate slowest instruction)
• Multi-cycle control: micro-programmed– Short clock period– High CPI
• Can we have both low CPI and short clock period?
Single-cycle
Multi-cycle
insn0.(fetch,decode,exec) insn1.(fetch,decode,exec)
insn0.decinsn0.fetch insn1.decinsn1.fetchinsn0.exec insn1.exec
time
CSE502: Computer Architecture
Pipelining
• Start with multi-cycle design• When insn0 goes from stage 1 to stage 2
… insn1 starts stage 1• Each instruction passes through all stages
… but instructions enter and leave at faster rate
Multi-cycle insn0.decinsn0.fetch insn1.decinsn1.fetchinsn0.exec insn1.exec
timePipelined
insn0.execinsn0.decinsn0.fetch
insn1.decinsn1.fetch insn1.exec
insn2.decinsn2.fetch insn2.exec
Can have as many insns in flight as there are stages
CSE502: Computer Architecture
Instruction Dependencies• Data Dependence
– Read-After-Write (RAW) (only true dependence)• Read must wait until earlier write finishes
– Anti-Dependence (WAR)• Write must wait until earlier read finishes (avoid clobbering)
– Output Dependence (WAW)• Earlier write can’t overwrite later write
• Control Dependence (a.k.a. Procedural Dependence)– Branch condition must execute before branch target– Instructions after branch cannot run before branch
CSE502: Computer Architecture
Pipeline Terminology• Pipeline Hazards
– Potential violations of program dependencies– Must ensure program dependencies are not violated
• Hazard Resolution– Static method: performed at compile time in software– Dynamic method: performed at runtime using hardware– Two options: Stall (costs perf.) or Forward (costs hw.)
• Pipeline Interlock– Hardware mechanism for dynamic hazard resolution– Must detect and enforce dependencies at runtime
CSE502: Computer Architecture
Simple 5-stage Pipeline
PC InstCache
Regis
ter
file
MUX
ALU
1
DataCache
++
MUX
IF/ID ID/EX EX/Mem Mem/WB
MUX
op
dest
offset
valB
valA
PC+1PC+1target
ALUresult
op
dest
valB
op
dest
ALUresult
mdata
eq?instru
ction
0
R2R3R4R5
R1
R6
R0
R7
regAregB
datadest
MUX
CSE502: Computer Architecture
Balancing Pipeline StagesCoarser-Grained Machine Cycle:
4 machine cyc / instructionFiner-Grained Machine Cycle: 11
machine cyc /instruction
TIF&ID= 8 units
TOF= 9 units
TEX= 5 units
TOS= 9 units
IFID
OF
WB
EX# stages = 11Tcyc= 3 units
IF
IFID
OF
OF
OF
EXEX
WB
WB
WB
# stages = 4Tcyc= 9 units
CSE502: Computer Architecture
IPC vs. Frequency• 10-15% IPC not bad if frequency can double
• Frequency doesn’t double– Latch/pipeline overhead– Stage imbalance
1000ps 500ps500ps2.0 IPC, 1GHz 1.7 IPC, 2GHz
2 BIPS 3.4 BIPS
900ps 450ps 450ps
900ps 350 550
1.5GHz
CSE502: Computer Architecture
Architectures for Instruction Parallelism• Scalar pipeline (baseline)
– Instruction/overlap parallelism = D– Operation Latency = 1– Peak IPC = 1.0
D
Su
ccess
ive
Inst
ruct
ion
s
Time in cycles1 2 3 4 5 6 7 8 9 10 11 12
D different instructions overlapped
CSE502: Computer Architecture
Superscalar Machine• Superscalar (pipelined) Execution
– Instruction parallelism = D x N– Operation Latency = 1– Peak IPC = N per cycle
Su
ccess
ive
Inst
ruct
ion
s
Time in cycles1 2 3 4 5 6 7 8 9 10 11 12
N
D x N different instructions overlapped
CSE502: Computer Architecture
RISC ISA Format• Fixed-length
– MIPS all insts are 32-bits/4 bytes
• Few formats– MIPS has 3: R (reg, reg, reg), I (reg, reg, imm), J (addr)– Alpha has 5: Operate, Op w/ Imm, Mem, Branch, FP
• Regularity across formats (when possible/practical)– MIPS & Alpha opcode in same bit-position for all formats– MIPS rs & rt fields in same bit-position for R and I formats– Alpha ra/fa field in same bit-position for all 5 formats
CSE502: Computer Architecture
Superscalar Decode for RISC ISAs• Decode X insns. per cycle (e.g., 4-wide)
– Just duplicate the hardware– Instructions aligned at 32-bit boundaries
32-bit inst
Decoder
decodedinst
scalar
Decoder Decoder Decoder
32-bit inst
Decoder
decodedinst
superscalar
4-wide superscalar fetch
32-bit inst32-bit inst32-bit inst
decodedinst
decodedinst
decodedinst
1-Fetch
CSE502: Computer Architecture
CISC ISA• RISC focus on fast access to information
– Easy decode, I$, large RF’s, D$
• CISC focus on max expressiveness per min space– Designed in era with fewer transistors, chips– Each memory access very expensive
• Pack as much work into as few bytes as possible• More “expressive” instructions
– Better potential code generation in theory– More complex code generation in practice
CSE502: Computer Architecture
Mode Example Meaning
Register ADD R4, R3, R2 R4 = R3 + R2
ADD in RISC ISA
CSE502: Computer Architecture
Mode Example Meaning
Register ADD R4, R3 R4 = R4 + R3
Immediate ADD R4, #3 R4 = R4 + 3
Displacement ADD R4, 100(R1) R4 = R4 + Mem[100+R1]
Register Indirect
ADD R4, (R1) R4 = R4 + Mem[R1]
Indexed/Base ADD R3, (R1+R2) R3 = R3 + Mem[R1+R2]
Direct/Absolute ADD R1, (1234) R1 = R1 + Mem[1234]
Memory Indirect
ADD R1, @(R3) R1 = R1 + Mem[Mem[R3]]
Auto-Increment
ADD R1,(R2)+ R1 = R1 + Mem[R2]; R2++
Auto-Decrement
ADD R1, -(R2) R2--; R1 = R1 + Mem[R2]
ADD in CISC ISA
CSE502: Computer Architecture
RISC (MIPS) vs CISC (x86)
lui R1, Disp[31:16]ori R1, R1, Disp[15:0]add R1, R1, R2shli R3, R3, 3add R3, R3, R1lui R1, Imm[31:16]ori R1, R1, Imm[15:0]st [R3], R1
MOV [EBX+EAX*8+Disp], Imm
8 insns. at 32 bits each vs 1 insn. at 88 bits: 2.9x!
CSE502: Computer Architecture
x86 Encoding• Basic x86 Instruction:
Prefixes0-4 bytes
Opcode1-2 bytes
Mod R/M0-1 bytes
SIB0-1 bytes
Displacement0/1/2/4 bytes
Immediate0/1/2/4 bytes
Longest Inst 15 bytesShortest Inst: 1 byte
• Opcode has flag indicating Mod R/M is present– Most instructions use the Mod R/M byte– Mod R/M specifies if optional SIB byte is
used– Mod R/M and SIB may specify additional
constantsInstruction length not known until after decode
CSE502: Computer Architecture
Instruction Cache Organization• To fetch N instructions per cycle...
– L1-I line must be wide enough for N instructions
• PC register selects L1-I line• A fetch group is the set of insns. starting at PC
– For N-wide machine, [PC,PC+N-1]
Deco
der
Tag Inst Inst Inst InstTag Inst Inst Inst InstTag Inst Inst Inst Inst
Tag Inst Inst Inst InstTag Inst Inst Inst Inst
Cache LinePC
CSE502: Computer Architecture
Fetch Misalignment• Now takes two cycles to fetch N instructions
Deco
der
Tag Inst Inst Inst InstTag Inst Inst Inst InstTag Inst Inst Inst Inst
Tag Inst Inst Inst Inst
Tag Inst Inst Inst Inst
000001010011
111
PC: xxx01001 00 01 10 11
Deco
der
Tag Inst Inst Inst InstTag Inst Inst Inst InstTag Inst Inst Inst Inst
Tag Inst Inst Inst Inst
Tag Inst Inst Inst Inst
000001010011
111
PC: xxx01100 00 01 10 11
Inst Inst Inst
Inst
Cycle 1
Cycle 2
Inst Inst Inst
CSE502: Computer Architecture
Fragmentation due to Branches• Fetch group is aligned, cache line size > fetch group
– Taken branches still limit fetch width
Deco
der
Tag Inst Inst Inst InstTag Inst Branch InstTag Inst Inst Inst Inst
Tag Inst Inst Inst InstTag Inst Inst Inst Inst
Inst
X X
CSE502: Computer Architecture
Types of Branches• Direction:
– Conditional vs. Unconditional
• Target:– PC-encoded
• PC-relative• Absolute offset
– Computed (target derived from register)
Need direction and target to find next fetch group
CSE502: Computer Architecture
Branch Prediction Overview• Use two hardware predictors
– Direction predictor guesses if branch is taken or not-taken– Target predictor guesses the destination PC
• Predictions are based on history– Use previous behavior as indication of future behavior– Use historical context to disambiguate predictions
CSE502: Computer Architecture
Direction vs. Target Prediction• Direction: 0 or 1• Target: 32- or 64-bit value• Turns out targets are generally easier to predict
– Don’t need to predict N-t target– T target doesn’t usually change
• Only need to predict taken-branch targets• Prediction is really just a “cache”
– Branch Target Buffer (BTB)
TargetPred
+
sizeof(inst)
PC
CSE502: Computer Architecture
Branch Target Buffer (BTB)
V BIA BTA
Branch PC
Branch TargetAddress
=
Valid Bit
Hit?
Branch InstructionAddress (Tag)
Next Fetch PC
CSE502: Computer Architecture
BTB w/Partial Tags
00000000cfff9810
00000000cfff9824
00000000cfff984c
v00000000cfff98100000000cfff9704
v00000000cfff98200000000cfff9830
v00000000cfff98400000000cfff9900
00000000cfff9810
00000000cfff9824
00000000cfff984c
v f98100000000cfff9704
v f98200000000cfff9830
v f98400000000cfff9900
00001111beef9810
Fewer bits to compare, but prediction may alias
CSE502: Computer Architecture
BTB w/PC-offset Encoding
00000000cfff984c
v f98100000000cfff9704
v f98200000000cfff9830
v f98400000000cfff9900
00000000cfff984c
v f981ff9704
v f982ff9830
v f984ff9900
00000000cf ff9900
If target too far or PC rolls over, will mispredict
CSE502: Computer Architecture
Branches Have Locality• If a branch was previously taken…
– There’s a good chance it’ll be taken again
for(i=0; i < 100000; i++){
/* do stuff */}
This branch will be taken99,999 times in a row.
CSE502: Computer Architecture
Last Outcome Predictor• Do what you did last time
0xDC08: for(i=0; i < 100000; i++){
0xDC44: if( ( i % 100) == 0 )
tick( );
0xDC50: if( (i & 1) == 1)odd( );
}
T
N
CSE502: Computer Architecture
Saturating Two-Bit Counter
0 1
FSM for Last-OutcomePrediction
0 1
2 3
FSM for 2bC(2-bit Counter)
Predict N-t
Predict T
Transition on T outcome
Transition on N-t outcome
CSE502: Computer Architecture
Typical Organization of 2bC Predictor
PC hash32 or 64 bits
log2 n bits
n entries/counters
Prediction
FSMUpdateLogic
table update
Actual outcome
CSE502: Computer Architecture
Track the History of Branches
PC Previous Outcome
1Counter if prev=0
3 0Counter if prev=1
1 3 3
prev = 1 3 prediction = T3
prev = 1 3 prediction = T3
prev = 1 3 prediction = T2
prev = 0 3 prediction = T2
CSE502: Computer Architecture
Deeper History Covers More Patterns• Counters learn “pattern” of prediction
PC
0 310 1 3 1 0 02 2
Previous 3 Outcomes Counter if prev=000
Counter if prev=001
Counter if prev=010
Counter if prev=111
001 1; 011 0; 110 0; 100 100110011001… (0011)*
CSE502: Computer Architecture
Predictor Training Time• Ex: prediction equals opposite for 2nd most recent
• Hist Len = 2• 4 states to train:
NN TNT TTN NTT N
• Hist Len = 3• 8 states to train:
NNN TNNT TNTN NNTT NTNN TTNT TTTN NTTT N
CSE502: Computer Architecture
Predictor Organizations
PC Hash
Different pattern foreach branch PC
PC Hash
Shared set ofpatterns
PC Hash
Mix of both
CSE502: Computer Architecture
Two-Level Predictor Organization• Branch History Table (BHT)
– 2a entries– h-bit history per entry
• Pattern History Table (PHT)– 2b sets– 2h counters per set
• Total Size in bits– h2a + 2(b+h)2
PC Hash a
b
h
Each entry is a 2-bit counter
CSE502: Computer Architecture
Combined Indexing• “gshare” (S. McFarling)
PC Hash
k
XOR
k = log2counters
k
CSE502: Computer Architecture
OoO Execution• Out-of-Order execution (OoO)
– Totally in the hardware– Also called Dynamic scheduling
• Fetch many instructions into instruction window– Use branch prediction to speculate past branches
• Rename regs. to avoid false deps. (WAW and WAR)• Execute insns. as soon as possible
– As soon as deps. (regs and memory) are known
• Today’s machines: 100+ insns. scheduling window
CSE502: Computer Architecture
Superscalar != Out-of-Order
A: R1 = Load 16[R2]B: R3 = R1 + R4C: R6 = Load 8[R9]D: R5 = R2 – 4E: R7 = Load 20[R5]F: R4 = R4 – 1G: BEQ R4, #0
C
D
E
cach
e m
iss
B
C
D
E
F
G
10 cycles
B
F
G
7 cycles
A
B
C D
E
F
G
C
D E
F
G
B
5 cycles
B C
D
E F
G
8 cycles
A
cach
e m
iss
1-wideIn-Order
A
cach
e m
iss
2-wideIn-Order
A
1-wideOut-of-Order
A
cach
e m
iss
2-wideOut-of-Order
CSE502: Computer Architecture
Review of Register Dependencies
A: R1 = R2 + R3B: R4 = R1 * R4
5-293
R1R2R3R4
Read-After-Write
7-293
7-2921
A
B
5-293
R1R2R3R4
5-2915
7-2915
B
A
A: R1 = R3 / R4B: R3 = R2 * R4
Write-After-Read
5-293
R1R2R3R4
3-293
3-2-63
AB
5-293
R1R2R3R4
5-2-63
-2-2-63
AB
Write-After-Write
A: R1 = R2 + R3B: R1 = R3 * R4
5-293
R1R2R3R4
7-293
27-293
A B
5-293
R1R2R3R4
27-293
7-293
AB
CSE502: Computer Architecture
Register Renaming• Register renaming (in hardware)
– “Change” register names to eliminate WAR/WAW hazards– Arch. registers (r1,f0…) are names, not storage locations– Can have more locations than names– Can have multiple active versions of same name
• How does it work?– Map-table: maps names to most recent locations– On a write: allocate new location, note in map-table– On a read: find location of most recent write via map-table
CSE502: Computer Architecture
Tomasulo’s Algorithm• Reservation Stations (RS): instruction buffer• Common data bus (CDB): broadcasts results to RS• Register renaming: removes WAR/WAW hazards• Bypassing (not shown here to make example simpler)
CSE502: Computer Architecture
Tomasulo Data Structuresvalue
V1 V2
FU
T
T2T1Top========
Map Table
Reservation Stations
CD
B.V
CD
B.T
Fetchedinsns
Regfile
R
T
========
CSE502: Computer Architecture
Where is the “register rename”?
• Value copies in RS (V1, V2)• Insn. stores correct input values in its own RS entry• “Free list” is implicit (allocate/deallocate as part of RS)
value
V1 V2
FU
T
T2T1Top========
Map Table
Reservation Stations
CD
B.V
CD
B.T
Fetchedinsns
Regfile
R
T
========
CSE502: Computer Architecture
Precise State• Speculative execution requires
– (Ability to) abort & restart at every branch– Abort & restart at every load
• Synchronous (exception and trap) events require– Abort & restart at every load, store, divide, …
• Asynchronous (hardware) interrupts require– Abort & restart at every ??
• Real world: bite the bullet– Implement abort & restart at every insn.– Called precise state
CSE502: Computer Architecture
Complete and Retire
• Complete (C): insns. write results into ROB– Out-of-order: don’t block younger insns.
• Retire (R): a.k.a. commit, graduate– ROB writes results to register file– In-order: stall back-propagates to younger insns.
regfile
L1-DI$
BP
Re-Order Buffer (ROB)
C R
CSE502: Computer Architecture
P6 Data Structuresvalue
V1 V2
FU
T+
T2T1Top========
Map Table
RS
CD
B.V
CD
B.T
Dispatch
Regfile
T
========
R value
ROB
HeadRetire
TailDispatch
CSE502: Computer Architecture
MIPS R10K: Alternative Implementation
• One big physical register file holds all data - no copies+ Register file close to FUs small and fast data path– ROB and RS “on the side” used only for control and tags
FU
T+
T2+T1+Top========
Map Table
RS
CD
B.T
Dispatch
T
========
R value
ROB
HeadRetire
TailDispatch
ToldTT
FreeList
T
Arch.Map
Regfile
CSE502: Computer Architecture
Executing Memory Instructions• If R1 != R7
– Then Load R8 gets correct value from cache• If R1 == R7
– Then Load R8 should get value from the Store– But it didn’t!
Load R3 = 0[R6]
Add R7 = R3 + R9
Store R4 0[R7]
Sub R1 = R1 – R2
Load R8 = 0[R1]
Issue
Issue
Cache Miss!
Issue Cache Hit!
Miss serviced…Issue
Issue
But there was a later load…
CSE502: Computer Architecture
Memory Disambiguation Problem• Ordering problem is a data-dependence violation• Imprecise memory worse than imprecise registers
• Why can’t this happen with non-memory insts?– Operand specifiers in non-memory insns. are absolute
• “R1” refers to one specific location
– Operand specifiers in memory insns. are ambiguous• “R1” refers to a memory location specified by the value of R1. • When pointers (e.g., R1) change, so does this location
CSE502: Computer Architecture
Two Problems• Memory disambiguation on loads
– Do earlier unexecuted stores to the same address exist?• Binary question: answer is yes or no
• Store-to-load forwarding problem– I’m a load: Which earlier store do I get my value from?– I’m a store: Which later load(s) do I forward my value to?
• Non-binary question: answer is one or more insn. identifiers
CSE502: Computer Architecture
Load/Store Queue (1/2)• Load/store queue (LSQ)
– Completed stores write to LSQ– When store retires, head of LSQ written to L1-D
• (or write buffer)
– When loads execute, access LSQ and L1-D in parallel• Forward from LSQ if older store with matching address
CSE502: Computer Architecture
Load/Store Queue (2/2)
regfile
L1-D
I$
BP
ROB
LSQload/store
store data
addr
load data
Almost a “real” processor diagram
CSE502: Computer Architecture
Loads Execute When …• Most aggressive approach• Relies on fact that storeload forwarding is rare• Greatest potential IPC – loads never stall
• Potential for incorrect execution– Need to be able to “undo” bad loads
CSE502: Computer Architecture
Detecting Ordering Violations• Case 1: Older store execs before younger load
– No problem; if same address stld forwarding happens
• Case 2: Older store execs after younger load– Store scans all younger loads– Address match ordering violation
CSE502: Computer Architecture
Loads Checking for Earlier Stores• On Load dispatch, find data from earlier Store
ST 0x4000
ST 0x4000
ST 0x4120
LD 0x4000
=
Address Bank Data Bank
=
=
=
=
=
=
0
No earliermatches
Addr match
Valid store
Use thisstore
Need to adjust this so thatload need not be at bottom,and LSQ can wrap-around
If |LSQ| is large, logic can beadapted to have log delay
CSE502: Computer Architecture
Sim
ilar Lo
gic to
Pre
vio
us S
lide
Data Forwarding• On execute Store (STA+STD), check for later Loads
ST 0x4000
ST 0x4120
LD 0x4000
Addr Match
Is LoadCaptureValue
Overwritten
Overwritten
Data Bank
This is ugly, complicated, slow, and power hungry
ST 0x4000
CSE502: Computer Architecture
Data-Capture Scheduler
• Dispatch: read available operands from ARF/ROB, store in scheduler
• Commit: Missing operands filled in from bypass
• Issue: When ready, operands sent directly from scheduler to functional units
Fetch &Dispatch
ARF PRF/ROB
Data-CaptureScheduler
FunctionalUnits
Physica
l reg
ister u
pdate
Bypass
CSE502: Computer Architecture
Scheduling Loop or Wakeup-Select Loop• Wake-Up Part:
– Executing insn notifies dependents– Waiting insns. check if all deps are satisfied
• If yes, “wake up” instutrction
• Select Part:– Choose which instructions get to execute
• More than one insn. can be ready• Number of functional units and memory ports are limited
CSE502: Computer Architecture
Interaction with Execution
A
Sele
ct Log
ic
SRD SL opcode ValL ValR
ValL ValR
ValL ValRValL ValR
Payload RAM
CSE502: Computer Architecture
Simple Scheduler Pipeline
Select Payload
Wakeup
A: Execute
CaptureB:
tag broadcastresultbroadcast
enablecapture
on tag match
Select Payload Execute
Wakeup CaptureC:enablecapture
tag broadcast
Cycle i Cycle i+1
A
B
C
Very long clock cycle
CSE502: Computer Architecture
Deeper Scheduler Pipeline
Select PayloadA: Execute
CaptureB:
tag broadcastresultbroadcast
enablecapture
Select Payload Execute
CaptureC:enablecapture
tag broadcast
Cycle i Cycle i+1
Select Payload Execute
Cycle i+2 Cycle i+3
Wakeup
Wakeup
A
B
C
Faster, but Capture & Payload on same cycle
CSE502: Computer Architecture
Very Deep Scheduler Pipeline
Select PayloadA: Execute
CaptureC:
Cycle i
Wakeup
i+1 i+2 i+3
Select Payload Execute
Wakeup Capture
Select Payload Execute
i+4 i+5
D:
A
C
B
D
Wakeup Capture
B: Select Select Payload Execute
A&B bothready, onlyA selected,B bids again
AC and CD mustbe bypassed,
BD OK without bypass
i+6
Dependent instructions can’t execute back-to-back
CSE502: Computer Architecture
Non-Data-Capture Scheduler
Fetch &Dispatch
ARF PRF
Scheduler
FunctionalUnits
Physica
l reg
ister
up
date
Fetch &Dispatch
UnifiedPRF
Scheduler
FunctionalUnits
Physica
l reg
ister
up
date
CSE502: Computer Architecture
Pipeline Timing
Select Payload
Wakeup
Execute
Select Payload Execute
Select Payload Read Operands from PRF
Wakeup
Execute
Select Payload Read Operands from PRF Exec
S X EX X
S X E
“Skip” Cycle
Substantial increase in schedule-to-execute latency
Data-Capture
Non-Data-Capture
CSE502: Computer Architecture
Handling Multi-Cycle Instructions
Sched PayLd Exec
Sched PayLd Exec
Add R1 = R2 + R3
Xor R4 = R1 ^ R5
Sched PayLd Exec Add R4 = R1 + R5WU
Sched PayLd Exec Mul R1 = R2 × R3Exec Exec
Instructions can’t execute too early
WU
CSE502: Computer Architecture
Non-Deterministic Latencies• Real situations have unknown latency
– Load instructions• Latency {L1_lat, L2_lat, L3_lat, DRAM_lat}• DRAM_lat is not a constant either, queuing delays
– Architecture specific cases• PowerPC 603 has “early out” for multiplication• Intel Core 2’s has early out divider also
CSE502: Computer Architecture
Load-Hit Speculation• Caches work pretty well
– Hit rates are high (otherwise we wouldn’t use caches)– Assume all loads hit in the cache
Sched PayLd Exec R2 = R1 + #4
Sched PayLd Exec R1 = 16[$sp]Exec Exec Cache hit,data forwardedBroadcast delayed
by DL1 latency
What to do on a cache miss?
CSE502: Computer Architecture
Simple Select Logic
Scheduler Entries
1
S entriesyields O(S)gate delay
Grant0 = 1Grant1 = !Bid0
Grant2 = !Bid0 & !Bid1
Grant3 = !Bid0 & !Bid1 & !Bid2
Grantn-1 = !Bid0 & … & !Bidn-21
x0
x1
x2
x3
x4
x5
x6
x7
x8
grant0
xi = Bidi
granti
grant1
grant2
grant3
grant4
grant5
grant6
grant7
grant8
grant9
O(log S) gates
CSE502: Computer Architecture
Implementing Oldest First Select
B
C
A
D
E
F
H
G
4
6053172
0
3
∞
22
0
0
Age-Aware Select Logic
Grant
Must broadcast grant age to instructions
CSE502: Computer Architecture
Problems in N-of-M Select
B
C
A
D
E
F
H
G
4
6053172
Age-A
ware
1-o
f-M
Age-A
ware
1-o
f-M
∞A
ge-A
ware
1-o
f-M∞
∞
N layers O(N log M) delay
O(lo
g M
) gate
dela
y / se
lect
CSE502: Computer Architecture
Select Binding
XORSUB
28
Sele
ct Log
ic for A
LU1
Sele
ct Log
ic for A
LU2
21
ADD 4 1
ADD 5 1
CMP 3 2
Not-Quite-Oldest-First:Ready insns are aged 2, 3, 4
Issued insns are 2 and 4
XORSUB
28
Sele
ct Log
ic for A
LU1
Sele
ct Log
ic for A
LU2
21
ADD 4 1
ADD 5 1
CMP 3 2
(Idle)
Wasted Resources:3 instructions are ready
Only 1 gets to issue
CSE502: Computer Architecture
Execution Ports
• Divide functional units into P groups– Called “ports”
• Area only O(P2M log M), where P << F
• Logic for tracking bids and grants less complex (deals with P sets)
ADD 3
LOAD 5
ADD 2
MUL 8
Port 0Port 1Port 2Port 3Port 4
ALU1 ALU2 ALU3 M/D
Shift
FAdd
FM/D SIMDLoad Store
CSE502: Computer Architecture
Decentralized RS• Natural split: INT vs. FP
Port 1
Port 3
Store Load
ALU1 ALU2
FP-LdFP-St
FAdd FM/D
L1 Data Cache
FP-only
wake
upIn
t-only
wake
up
INTRF
FPRF
FP ClusterInt Cluster
Often implies non-ROB basedphysical register file:
One “unified” integerPRF, and one “unified”FP PRF, each managedseparately with their
own free lists
Port 0
Port 2
CSE502: Computer Architecture
Higher Complexity not Worth Effort
“Effort”
Performance
ScalarIn-Order
Moderate-PipeSuperscalar/OOO
Very-Deep-PipeAggressive
Superscalar/OOO
Made sense to goSuperscalar/OOO:
good ROI
Very little gain forsubstantial effort
CSE502: Computer Architecture
SMP Machines• SMP = Symmetric Multi-Processing
– Symmetric = All CPUs have “equal” access to memory• OS seems multiple CPUs
– Runs one process (or thread) on each CPU
CPU0
CPU1
CPU2
CPU3
CSE502: Computer Architecture
MP Workload Benefits
3-wideOOOCPU
Task A Task B
4-wideOOOCPU
Task A Task B
Benefit
3-wideOOOCPU
Task A Task B3-wideOOOCPU
2-wideOOOCPU
Task BTask A2-wide
OOOCPU
runtime
CSE502: Computer Architecture
… If Only One Task Available
3-wideOOOCPU
Task A
4-wideOOOCPU
Task ABenefit
3-wideOOOCPU
3-wideOOOCPU
Task A
2-wideOOOCPU
2-wideOOOCPU
Task A
runtime
Idle
No benefit over 1 CPU
Performancedegradation!
CSE502: Computer Architecture
Chip-Multiprocessing (CMP)• Simple SMP on the same chip
– CPUs now called “cores” by hardware designers– OS designers still call these “CPUs”
Intel “Smithfield” Block Diagram AMD Dual-Core Athlon FX
CSE502: Computer Architecture
On-chip Interconnects (1/4)• Today, (Core+L1+L2) = “core”
– (L3+I/O+Memory) = “uncore”
• How to interconnect multiple “core”s to “uncore”?
• Possible topologies– Bus– Crossbar– Ring– Mesh– Torus
LLC $
MemoryControll
er
Core
$
Core
$
Core
$
Core
$
CSE502: Computer Architecture
On-chip Interconnects (2/4)• Possible topologies
– Bus– Crossbar– Ring– Mesh– Torus
$Bank 0
MemoryControll
er
Core$
Core$
Core$
Core$
$Bank 1
$Bank 2
$Bank 3
Ora
cle U
ltra
SPA
RC
T5
(3
.6G
Hz,
16
core
s, 8
thre
ads
per
core
)
CSE502: Computer Architecture
On-chip Interconnects (3/4)• Possible topologies
– Bus– Crossbar– Ring– Mesh– Torus $
Bank 0
MemoryControll
er
Core$
Core$
Core$
Core$
$Bank 1
$Bank 2
$Bank 3
Inte
l Sandy B
ridge
(3.5
GH
z,6
core
s, 2
thre
ads
per
core
)
• 3 ports per switch• Simple and cheap• Can be bi-directional to
reduce latency
CSE502: Computer Architecture
On-chip Interconnects (4/4)• Possible topologies
– Bus– Crossbar– Ring– Mesh– Torus
Tile
ra T
ile6
4 (
86
6M
Hz,
64
co
res)
Core$$
Bank 1
$Bank
0
Core$
Core$$
Bank 4
Core$$
Bank 3
MemoryControll
er
Core$$
Bank 2
Core$$
Bank 7
Core$$
Bank 6
Core$$
Bank 5
• Up to 5 ports per switchTiled organization combines core and cache
CSE502: Computer Architecture
Multi-Threading• Uni-Processor: 4-6 wide, lucky if you get 1-2 IPC
– Poor utilization of transistors
• SMP: 2-4 CPUs, but need independent threads– Poor utilization as well (if limited tasks)
• {Coarse-Grained,Fine-Grained,Simultaneous}-MT– Use single large uni-processor as a multi-processor
• Core provide multiple hardware contexts (threads)– Per-thread PC– Per-thread ARF (or map table)
– Each core appears as multiple CPUs• OS designers still call these “CPUs”
CSE502: Computer Architecture
Scalar PipelineTime
Dependencies limit functional unit utilization
CSE502: Computer Architecture
Superscalar PipelineTime
Higher performance than scalar, but lower utilization
CSE502: Computer Architecture
Chip Multiprocessing (CMP)Time
Limited utilization when running one thread
CSE502: Computer Architecture
Coarse-Grained MultithreadingTime
Only good for long latency ops (i.e., cache misses)
Hardw
are Context Switch
CSE502: Computer Architecture
Fine-Grained MultithreadingTime
Saturated workload -> Lots of threads
Unsaturated workload -> Lots of stalls
Intra-thread dependencies still limit performance
CSE502: Computer Architecture
Simultaneous MultithreadingTime
Max utilization of functional units
CSE502: Computer Architecture
Paired vs. Separate Processor/Memory?
• Separate CPU/memory– Uniform memory access
(UMA)• Equal latency to memory
– Low peak performance
• Paired CPU/memory– Non-uniform memory access
(NUMA)• Faster local memory• Data placement matters
– High peak performance
CPU($)
Mem
CPU($)
Mem
CPU($)
Mem
CPU($)
Mem
CPU($)Mem
CPU($)Mem
CPU($)Mem
CPU($)MemR RRR
CSE502: Computer Architecture
Issues for Shared Memory Systems• Two big ones
– Cache coherence– Memory consistency model
• Closely related• Often confused
CSE502: Computer Architecture
A: 0
Cache Coherence: The Problem• Variable A initially has value 0• P1 stores value 1 into A• P2 loads A from memory and sees old value 0
Bus
P1t1: Store A=1 P2
A: 0
A: 0 1 A: 0
Main Memory
L1
t2: Load A?
L1
Need to do something to keep P2’s cache coherent
CSE502: Computer Architecture
Simple MSI ProtocolSt
ore
/ Bu
sRdX
Invalid
Load / BusRd
SharedLoad / --
BusRd / [BusReply]
Cache Actions:• Load, Store, Evict Bus Actions:• BusRd, BusRdX
BusInv, BusWB,BusReplyModified
BusRdX / BusReply
Evict / --
BusRd / BusReply
Evict / BusWB
Load, Store / --
Store / BusIn
v
BusRdX, BusInv / [BusReply]
Usable coherence protocol
CSE502: Computer Architecture
Coherence vs. Consistency• Coherence concerns only one memory location• Consistency concerns ordering for all locations• A Memory System is Coherent if
– Can serialize all operations to that location• Operations performed by any core appear in program order
– Read returns value written by last store to that location
• A Memory System is Consistent if– It follows the rules of its Memory Model
• Operations on memory locations appear in some defined order
CSE502: Computer Architecture
Sequential Consistency (SC)
switch randomly setafter each memory op
processorsissue memory opsin program order
P1 P2 P3
Memory
Defines Single Sequential Order Among All Ops.
CSE502: Computer Architecture
Mutex Example w/ Store Buffer
P1 P2lockA: A = 1; lockB: B=1;if (B != 0) if (A != 0) { A = 0; goto lockA; } { B = 0; goto lockB; }/* critical section*/ /* critical section*/A = 0; B = 0;
Shared Bus
P1Read Bt1 t3
P2Read At2 t4
A: 0B: 0
Write A Write B
Does not work
CSE502: Computer Architecture
Relaxed Consistency Models• Sequential Consistency (SC):
– R → W, R → R, W → R, W → W
• Total Store Ordering (TSO) relaxes W → R– R → W, R → R, W → W
• Partial Store Ordering relaxes W → W (coalescing WB)– R → W, R → R
• Weak Ordering or Release Consistency (RC)– All ordering explicitly declared
• Use fences to define boundaries• Use acquire and release to force flushing of values
X → Y X must complete before Y
CSE502: Computer Architecture
Good Luck!