Upload
jan-heiskell
View
215
Download
1
Tags:
Embed Size (px)
Citation preview
2
This Lecture
Superscalar Hardware P6 & P4 Microarchitectures
3
Instruction Buffers
Integer register file
Floating point register file
Decode rename dispatch
Floating point inst. buffer
Integer address inst buffer
Functional units
Functional units and data cache
Memory interface
Reorder and commit
Inst.buffer
Pre-decode Inst.
Cache
4
Issue Buffer Organization
a) Single, shared queue b)Multiple queue; one per inst. type
No out-of-orderNo Renaming
No out-of-order inside queuesQueues issue out of order
5
Issue Buffer Organization
c) Multiple reservation stations; (one per instruction type or big pool)
NO FIFO ordering Ready operands, hardware available execution starts Proposed by Tomasulo
From Instruction Dispatch
6
Typical reservation station
Operation source 1 data 1 valid 1 source 2 data 2 valid 2 destination
7
Memory Hazard Detection Logic
Address add & translation
Address compare
Load address buffer
Store address buffer
loads
stores
Hazard Control
To memoryInstruction issue
8
Summary
Dynamic ILP Instruction buffer
Split ID into two stages one for in-order and other for out-of-order issue
Socreboard out-of-order, doesn’t deal with WAR/WAW hazards
Tomasulo’s algorithmUses register renaming to eliminate WAR/WAW
hazards Dynamic scheduling + precise state + speculation Superscalar
9
The P6 Microarchitecture
P6: Introduced in 1995 Basis for Pentium Pro, Pentium 2 and Pentium 3
Differences: Instruction set extensions (MMX added to Pentium 2, SSE added to Pentium 3)
3 Instructions fetched/decoded every cycle. Instructions are translated to uops. Uops: Risk instructions Register renaming and ROB is used. Pipeline is 14 stages: 8 stages to fetch/decode/dispatch in-order. 3 stages to execute out-of-order 3 stages to commit
10
The P6 Microarchitecture
Functional Units: integer unit, FP unit, branch unit, memory address unit.
Register Renaming uses 40 physical registers, 20 reservation stations and a 40 entry ROB.
Voltage 2.9, Power 14 watt
Dual Cavity Package, 0.6 micron process
11
The P6 Microarchitecture
Compared to Pentium (P5)
Pipeline stage 14 vs. 5 3-way vs. 2-way
Fundamental goal: Solve the memory latency problem
MOB (Memory Ordering Buffer) makes sure that: Stores : Never reordered, Never Speculated. Loads : Can Pass Loads/Stores (MOB-Memory Ordering Buffer)
Forwarding and Bypassing happen.
12
Dynamic Scheduling in P6
Q: How pipeline 1 to 17 byte 80x86 instructions? P6 doesn’t pipeline 80x86 instructions P6 decode unit translates the Intel instructions into 72-bit micro-operations (~ MIPS) Sends micro-operations to reorder buffer & reservation stations Many instructions translate to 1 to 4 micro-operations Complex 80x86 instructions are executed by a conventional microprogram (8K x 72 bits) that issues long sequences of micro-operations
13
Dynamic Scheduling in P6
Parameter 80x86 microopsMax. instructions issued/clock 3 6Max. instr. complete exec./clock 5Max. instr. commited/clock 3Window (Instrs in reorder buffer) 40Number of reservations stations 20Number of rename registers 40No. integer functional units (FUs) 2No. floating point FUs 1No. SIMD Fl. Pt. Fus 1No. memory Fus 1 load + 1 store
14
P6 Pipeline
8 stages are used for in-order instruction fetch, decode, and issue Takes 1 clock cycle to determine length of 80x86 instructions
+ 2 more to create the micro-operations (uops) 3 stages are used for out-of-order execution in one of 5
separate functional units 3 stages are used for instruction commit
InstrFetch16B/clk
InstrDecode3 Instr
/clk
Renaming3 uops
/clk
Execu-tionunits(5)
Gradu-ation
3 uops/clk
16B 6 uopsReserv.Station
ReorderBuffer
15
P6 Block Diagram
16
Pentium III Die Photo
EBL/BBL - Bus logic, Front, Back MOB - Memory Order Buffer Packed FPU - MMX Fl. Pt. (SSE) IEU - Integer Execution Unit FAU - Fl. Pt. Arithmetic Unit MIU - Memory Interface Unit DCU - Data Cache Unit PMH - Page Miss Handler DTLB - Data TLB BAC - Branch Address Calculator RAT - Register Alias Table SIMD - Packed Fl. Pt. RS - Reservation Station BTB - Branch Target Buffer IFU - Instruction Fetch Unit (+I$) ID - Instruction Decode ROB - Reorder Buffer MS - Micro-instruction Sequencer
1st Pentium III : 9.5 M transistors, 12.3 * 10.4 mm in 0.25-mi. with 5 layers of aluminum
17
P6 Performance: uops/x86 instr
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7
wave5
fpppp
apsi
turb3d
applu
mgrid
hydro2d
su2cor
swim
tomcatv
vortex
perl
ijpeg
li
compress
gcc
m88ksim
go
1.2 to 1.6 uops per IA-32 instruction: 1.36 avg. (1.37 integer)
18
P6: Branch Misprediction Rate
0% 5% 10% 15% 20% 25% 30% 35% 40% 45%
wave5
fpppp
apsi
turb3d
applu
mgrid
hydro2d
su2cor
swim
tomcatv
vortex
perl
ijpeg
li
compress
gcc
m88ksim
go
10% to 40% Miss/Mispredict ratio: 20% avg. (29% integer)
BTB miss frequency
Mispredict frequency
19
P6: Miss-predicted instructions
0% 10% 20% 30% 40% 50% 60%
wave5
fpppp
apsi
turb3d
applu
mgrid
hydro2d
su2cor
swim
tomcatv
vortex
perl
ijpeg
li
compress
gcc
m88ksim
go
1% to 60% instructions do not commit: 20% avg (30% integer)
20
P6 Performance: Cache Misses/1k instr
0 20 40 60 80 100 120 140 160
wave5
fpppp
apsi
turb3d
applu
mgrid
hydro2d
su2cor
swim
tomcatv
vortex
perl
ijpeg
li
compress
gcc
m88ksim
go
10 to 160 Misses per Thousand Instructions: 49 avg (30 integer)
L1 Instruction
L1 Data
L2
21
P6 Performance: uops commit/clock
Average0: 55%1: 13%2: 8%3: 23%
Integer0: 40%1: 21%2: 12%3: 27%
0% 20% 40% 60% 80% 100%
wave5
fpppp
apsi
turb3d
applu
mgrid
hydro2d
su2cor
swim
tomcatv
vortex
perl
ijpeg
li
compress
gcc
m88ksim
go
0 uops commit
1 uop commits
2 uops commit
3 uops commit
22
P6 vs. AMD Althon
Similar to P6 microarchitecture (Pentium III), but more resources
Transistors: PIII 24M v. Althon 37M Die Size: 106 mm2 v. 117 mm2
Power: 30W v. 76W Cache: 16K/16K/256K v. 64K/64K/256K Window size: 40 vs. 72 uops Rename registers: 40 v. 36 int +36 Fl. Pt. BTB: 512 x 2 v. 4096 x 2 Pipeline: 10-12 stages v. 9-11 stages Clock rate: 1.0 GHz v. 1.2 GHz Memory bandwidth: 1.06 GB/s v. 2.12 GB/s
23
Pentium 4
Known as NetBurst architecture
Still translate from 80x86 to micro-ops P4 has better branch predictor, more FUs Instruction Cache holds micro-operations vs. 80x86 instructions
no decode stages of 80x86 on cache hit called “trace cache” (TC)
Faster memory bus: 400 MHz v. 133 MHz Caches
Pentium III: L1I 16KB, L1D 16KB, L2 256 KB Pentium 4: L1I 12K uops, L1D 8 KB, L2 256 KB Block size: PIII 32B v. P4 128B; 128 v. 256 bits/clock
24
Pentium 4 features
Clock rates: Pentium III 1 GHz v. Pentium IV 1.5 GHz 14 stage pipeline vs. 24 stage pipeline
42 Million transistors
ALUs operate at 2X clock rate for many ops Rename registers: 40 vs. 128; Window: 40 v. 126 BTB: 512 vs. 4096 entries (Intel: 1/3 improvement)
Can retire 3 uops per cycle.
Branch Predictor removes 1/3 of mispredicted branches compared to P6
25
Pentium, Pentium Pro, P4 Pipeline
Pentium (P5) = 5 stagesPentium Pro, II, III (P6) = 10 stages (1 cycle ex)Pentium 4 (NetBurst) = 20 stages (no decode)
From “Pentium 4 (Partially) Previewed,” Microprocessor Report, 8/28/00
26
Block Diagram of Pentium 4 Microarchitecture
BTB = Branch Target Buffer (branch predictor) I-TLB = Instruction TLB, Trace Cache = Instruction cache (Delivers uops) RF = Register File; AGU = Address Generation Unit "Double pumped ALU" means ALU clock rate 2X => 2X ALU F.U.sFrom “Pentium 4 (Partially) Previewed,” Microprocessor Report, 8/28/00
27
Block Diagram of Pentium 4 Microarchitecture
Micro-op Queues: one for memory, one for non-memory operations. Register renaming: ROB is NOT used for register renaming. Dispatch bandwidth (6) exceeds front-end and retirement bandwidth (3) ALU operations are done twice as fast as the clock. Key: ALU bypass loop
28
Pentium 4 Microarchitecture
Longest latencies: Multiply 14, Divide 60
Low-latency small 8K L1 cache, medium latency large 256 L2 cache
Store to Load Forwarding: Pending Loads use Pending Stores before the stores have happened.
29
Pentium 4 Die Photo
42M Xtors PIII: 26M
217 mm2
PIII: 106 mm2
L1 Execution Cache Buffer 12,000
Micro-Ops 8KB data cache 256KB L2$
30
Benchmarks: Pentium 4 v. PIII v. Athlon
SPECbase2000 Int, [email protected] GHz: 524, PIII@1GHz: 454, AMD [email protected]:? FP, [email protected] GHz: 549, PIII@1GHz: 329, AMD [email protected]:304
WorldBench 2000 benchmark (business) PC World magazine, Nov. 20, 2000 (bigger is better) P4 : 164, PIII : 167, AMD Athlon: 180
Quake 3 Arena: P4 172, Athlon 151
SYSmark 2000 composite: P4 209, Athlon 221
Office productivity: P4 197, Athlon 209
S.F. Chronicle 11/20/00: "… the challenge for AMD now will be to argue that frequency is not the most important thing-- precisely the position Intel has argued while its Pentium III lagged behind the Athlon in clock speed."
31
Why?
Instruction count is the same for x86 Clock rates: P4 > Athlon > PIII How can P4 be slower? Time =
Instruction count x CPI x 1/Clock rate Average Clocks Per Instruction (CPI) of P4 must be worse than
Athlon, PIII
32
Readings & Homework
Readings Download papers from the website: P6 and P4.