Upload
hsien-hsin-lee
View
668
Download
4
Embed Size (px)
Citation preview
ECE 4100/6100Advanced Computer Architecture
Lecture 12 P6 and NetBurst Microarchitecture
Prof. Hsien-Hsin Sean LeeSchool of Electrical and Computer EngineeringGeorgia Institute of Technology
2
P6 System Architecture
System System Memory Memory (DRAM)(DRAM)
MCHMCH
Front-Side Front-Side BusBus
PCI USB I/O
GraphicsGraphicsProcessor Processor
LocalFrameBuffer
PCIExpressAGP
(SRAM)(SRAM)L2 CacheL2 Cache
Back-SideBack-Side
BusBus
P6 CoreP6 Core
Host ProcessorHost Processor
L1L1CacheCache
(SRAM)(SRAM)
GPUGPU
ICHICHchipsetchipset
3
Instruction Fetch UnitInstruction Fetch Unit
P6 Microarchitecture
BTB/BACBTB/BAC
Instruction Fetch UnitInstruction Fetch Unit
Bus interface unitBus interface unit
InstructionInstruction
DecoderDecoder
InstructionInstruction
DecoderDecoderRegister Register Alias TableAlias Table
AllocatorAllocatorMicrocode Microcode SequencerSequencer
Reservation Reservation StationStation
ROB & ROB & Retire RFRetire RF
AGUAGU
MMXMMX
IEU/JEUIEU/JEUIEU/JEUIEU/JEU
FEUFEU
MIUMIU
Memory Memory Order BufferOrder Buffer
Data Cache Data Cache Unit (L1) Unit (L1)
External busExternal busChip boundaryChip boundary
Control Control FlowFlow
(Restricted)(Restricted)DataDataFlowFlowInstruction Fetch Cluster
Issue Cluster
Out-of-orderCluster
MemoryCluster
Bus Cluster
4
Pentium III Die Map EBL/BBL – External/Backside Bus logic MOB - Memory Order Buffer Packed FPU - Floating Point Unit for SSE IEU - Integer Execution Unit FAU - Floating Point Arithmetic Unit MIU - Memory Interface Unit DCU - Data Cache Unit (L1) PMH - Page Miss Handler DTLB - Data TLB BAC - Branch Address Calculator RAT - Register Alias Table SIMD - Packed Floating Point unit RS - Reservation Station BTB - Branch Target Buffer TAP – Test Access Port IFU - Instruction Fetch Unit and L1 I-Cache ID - Instruction Decode ROB - Reorder Buffer MS - Micro-instruction Sequencer
5
P6 Basics• One implementation of IA32 architecture• Deeply pipeline processor• In-order front-end and back-end• Dynamic execution engine (restricted dataflow)• Speculative execution• P6 microarchitecture family processors include
– Pentium Pro – Pentium II (PPro + MMX + 2x caches)– Pentium III (P-II + SSE + enhanced MMX, e.g. PSAD)– Pentium 4 (Not P6, will be discussed separately)– Pentium M (+SSE2, SSE3, op fusion)– Core (PM + SSSE3, SSE4, Intel 64 (EM64T), MacroOp
fusion, 4 op retired rate vs. 3 of previous proliferation)
6
P6 Pipelining
1111 1212 1313 1414 1515 1616 17172020 2121 2222
Next
IPNe
xt IP
I-Cac
heI-C
ache
ILD
ILD
Rota
teRo
tate
Dec1
Dec1
Dec2
Dec2
Br D
ecBr
Dec
RS W
rite
RS W
rite
RAT
RAT
IDQ
IDQ
In-order FEIn-order FE
3131 3232 33338181 8282.... ....
8383
Exec
2Ex
ec2
Exec
nEx
ec n
Multi-cycle Multi-cycle pipelinepipeline
3131 3232 33338181 8282
4242 43438383
AGU
AGU
DCac
he1
DCac
he1
DCac
he2
DCac
he2
Non-blocking Non-blocking memory pipelinememory pipeline
3131 3232 33338282 8383
RS s
chd
RS s
chd
RS D
isp
RS D
isp
Exec
/ W
BEx
ec /
WB
Single-cycle Single-cycle pipelinepipeline
83: Data WB83: Data WB82: Int WB schedule82: Int WB schedule81: Mem/FP WB81: Mem/FP WB
FE in
-ord
er b
ound
ary
FE in
-ord
er b
ound
ary
Retir
emen
t in-
orde
r bou
ndar
yRe
tirem
ent i
n-or
der b
ound
ary
9191 9292 9393
Ret p
tr wr
Ret p
tr wr
Ret R
OB rd
Ret R
OB rd
RRF
wrRR
F wr
…
…
…
… ……..
RS Scheduling RS Scheduling DelayDelay
ROB Scheduling ROB Scheduling DelayDelay
MOB Scheduling MOB Scheduling DelayDelay
IFU1IFU1 IFU2IFU2 IFU3IFU3 DEC1DEC1 DEC2DEC2 RATRAT ROBROB DISDIS EXEX RET1RET1 RET2RET2
3131 3232 33338181 8282
4242 43438383
AGU
AGU
MOB
blk
MOB
blk
MOB
wr
MOB
wr
4040 4141 4242 4343
MOB
dis
pM
OB d
isp
DCac
he1
DCac
he1
Dcac
he2
Dcac
he2
MOB
wak
eup
MOB
wak
eup
Blocking Blocking memory memory pipelinepipeline
7
Instruction Fetching Unit
• IFU1: Initiate fetch, requesting 16 bytes at a time• IFU2: Instruction length decoder, mark instruction boundaries, BTB makes
prediction (2 cycles)• IFU3: Align instructions to 3 decoders in 4-1-1 format
Streaming Buffer
Instruction Cache
Victim Cache
Instruction TLB
data addr
P.Addr
Branch Target Buffer
Next PCMux
Other fetch requests
Line
ar A
ddre
ss
Select mux
ILDLength marks
Instruction rotator
Instruction buffer
#bytes consumed by ID
Prediction marks
8
Static Branch Prediction (stage 17 Br. Dec of pg. 6)
BTB miss?BTB miss?
PC-relative?PC-relative?
Conditional?Conditional?
Backwards?Backwards?
Return?Return?
Unconditional Unconditional PC-relative?PC-relative?
NoNoNoNo
NoNo NoNo
NoNo
NoNo
YesYes
YesYes
YesYes
YesYes
YesYes
YesYes
BTB dynamic BTB dynamic predictor’s predictor’s decisiondecision
TakenTaken
TakenTakenTakenTaken
TakenTaken
TakenTaken
Indirect Indirect jumpjump
Not TakenNot Taken
9
Dynamic Branch Prediction
• Similar to a 2-level PAs design• Associated with each BTB entry• W/ 16-entry Return Stack Buffer • 4 branch predictions per cycle
(due to 16-byte fetch per cycle)• Speculative update (2 copies of
BHR)
Static prediction provided by Branch Address Calculator when BTB misses (see prior slide)
512-entry BTB 512-entry BTB 1 1 0Branch History RegisterBranch History Register
(BHR)(BHR)
0000 0001 0010
1111 1110
Pattern History Tables Pattern History Tables (PHT)(PHT)
Prediction
Rc: Branch ResultRc: Branch Result2-bit sat. counter
11 00
1 10 Spec. updateSpec. update
New (spec) historyNew (spec) history
1101
W0W0 W1W1 W2W2 W3W3
10
X86 Instruction Decode
• 4-1-1 decoder• Decode rate depends on instruction alignment• DEC1: translate x86 into micro-operation’s (ops) • DEC2: move decoded ops to ID queue• MS performs translations either
– Generate entire op sequence from the “microcode ROM”– Receive 4 ops from complex decoder, and the rest from microcode ROM
• Subsequent Instructions followed by the inst needing MS are flushed
complex(1-4)
simple(1)
simple(1)
(16 bytes)
Micro-instruction sequencer
(MS)
Instruction decoder queue(6 ops)
Next 3 inst #Inst to decS,S,S 3S,S,C First 2
S,C,S First 1
S,C,C First 1
C,S,S 3C,S,C First 2
C,C,S First 1
C,C,C First 1
S: SimpleC: Complex
Instruction Buffer
To RAT/ALLOC
11
Register Alias Table (RAT)
• Register renaming for 8 integer registers, 8 floating point (stack) registers and flags: 3 op per cycle
• 40 80-bit physicalphysical registers embedded in the ROB (thereby, 6 bit to specify PSrcPSrc)• RAT looks up physical ROB locations for renamed sources based on RRF bit• Override logic is for dependent ops decoded at the same cycle• Misprediction will revert all pointers to point to Retirement Register File (RRF)
In-o
rder
qu
eue
FP TOS Adjust
FP RAT Array
Integer RAT
Array
Logical Src
Int a
nd F
P O
verr
ides
Array Physical Src (Psrc)
RAT PSrc’s
Physical ROB Pointers
Allocator
25 2
ECX15
EAXEBXECXEDX
Renaming Example
ROBRRF
RRF PSrc0
0
0
1
12
Partial Stalls due to RAT
• Partial register stalls: Occurs when writing a smaller (e.g. 8/16-bit) register followed by a larger (e.g. 32-bit) read – Because need to read different partial pieces from multiple physical registers !
• Partial flags stalls: Occurs when a subsequent instruction reads more flags than a prior unretired instruction touches
EAXEAXAXAX writewritereadread
MOV AX, m8 ; MOV AX, m8 ; ADD EAX, m32 ; stallADD EAX, m32 ; stall
Partial register stallsPartial register stalls
XOR EAX, EAX XOR EAX, EAX MOV AL, m8 ; MOV AL, m8 ; ADD EAX, m32 ; no stallADD EAX, m32 ; no stall
SUB EAX, EAX SUB EAX, EAX MOV AL, m8 ; MOV AL, m8 ; ADD EAX, m32 ; no stallADD EAX, m32 ; no stall
Idiom Fix (1)Idiom Fix (1)
Idiom Fix (2)Idiom Fix (2)
CMP EAX, EBX CMP EAX, EBX INC ECX INC ECX JBE XX ; stallJBE XX ; stall
Partial flag stalls (1)Partial flag stalls (1)
JBEJBE reads both ZFZF and CFCF while INC affects (ZFZF,OF,SF,AF,PF) i.e. only ZFZF
LAHF LAHF loads low byte of EFLAGS EFLAGS while TEST TEST writes partial of them
TEST EBX, EBX TEST EBX, EBX LAHF ; stallLAHF ; stall
Partial flag stalls (2)Partial flag stalls (2)
13
Partial Register Width Renaming
• 32/16-bit accesses:– Read from low bank low bank
(AL/BL/CL/DL;AX/BX/CX/DX;EAX/EBX/ECX/EDX/EDI/ESI/EBP/ESP)– Write to both banks (AH/BH/CH/DH)
• 8-bit RAT accesses: depending on which bank is being written and only update the particular bank
In-o
rder
que
ue
FP TOS Adjust
FP RAT Array
Logical Src
Int a
nd F
P O
verri
es
Array Physical Src
RAT Physical Src
Physical ROB Pointers from Allocator
op0: MOV AL = (a)op1: MOV AH = (b)op2: ADD AL = (c)op3: ADD AH = (d)
Integer RAT Array
INT Low Bank (32b/16b/L): 8 entries
INT High Bank (H): 4 entries
Size(2) RRF(1) PSrc(6)
Allocator
14
Allocator (ALLOC)• The interface between in-order and out-of-
order pipelines• Allocates into ROB, MOB and RS
– “3-or-none” ops per cycle into ROB and RS• Must have 3 free ROB entries or no allocation
– “all-or-none” policy for MOB• Stall allocation when not all the valid MOB ops can be
allocated • Generate physical destination token PdstPdst
from the ROB and pass it to the Register Alias Table (RAT) and RS
• Stalls upon shortage of resources
15
Reservation Stations (RS)
• Gateway to execution: binding max 5 op to each port per cycle • Port binding at dispatch time (certain op can only be bound to one port)• 20 op entry buffer bridging the In-order and Out-of-order engine (32 entries in
Core)• RS fields include op opcode, data valid bits, Pdst, Psrc, source data, BrPred, etc.• Oldest first FIFO scheduling when multiple ops are ready at the same cycle
Port 0
Port 1
Port 2
Port 3
Port 4
IEU0 Fadd Fmul Imul Div
IEU1 JEU
AGU0
AGU1
MOB DCU
ROB RRF
Pfadd
Pfmul
Pfshuf
WB bus 1
WB bus 0
Ld addr
St addr
LDA
STA
STDSt data
Loaded data
RS
Retired data
16
ReOrder Buffer (ROB)• A 40-entry circular buffer (96-entry in
Core)– 157-bit wide– Provide 40 alias physical registers
• Out-of-order completion • Deposit exception in each entry• Retirement (or de-allocation)
– After resolving prior speculation– Handle exceptions thru MS– Clear OOO state when a mis-
predicted branch or exception is detected
– 3 op’s per cycle in program order– For multi-op x86 instructions:
none or all (atomic)
ALLOC
RAT
RS
RRFROB. . .
MS
(exp) code assist
17
Memory Execution Cluster
• Manage data memory accesses• Address Translation• Detect violation of access ordering• Fill buffers (FB) in DCU, similar to MSHR for non-blocking cache support
RS / ROBRS / ROB
LDLD STASTA STDSTD
DTLBDTLB
LDLD STASTADCUDCU
Load BufferLoad Buffer
Store BufferStore BufferEBLEBL
Memory ClusterMemory Cluster
movl ecx, edi addl ecx, 8 movl -4(edi), ebx
movl eax, 4(ecx)
RS cannot detect this and could dispatch them at the same timeFBFB
18
Memory Order Buffer (MOB)• Allocated by ALLOC• A second order RS for memory operations• 1 op for load; 2 op’s for store: Store Address (STA) and Store Data (STD)• MOB
16-entry load buffer (LB) (32-entry in Core, 64 in SandyBridge) 12-entry store address buffer (SAB) (20-entry in Core, 36 in
SandyBridge) SAB works in unison with
• Store data buffer (SDB) in MIU• Physical Address Buffer (PAB) in DCU
Store Buffer (SB): SAB + SDB + PAB• Senior Stores
Upon STD/STA retired from ROB SB marks the store “seniorsenior” Senior stores are committed back in program orderprogram order to memory when
bus idle or SB full• Prefetch instructions in P-III
Senior loadSenior load behavior Due to no explicit architectural destination
New Memory dependency predictor in Core to predict store-to-load dependencies
19
Store Coloring
• ALLOC assigns Store Buffer ID (SBID) in program order• ALLOC tags loads with the most recent SBID• Check loads against stores with equal or younger SBIDs for
potential address conflicts • SDB forwards data if conflict detected
x86 Instructions op’s store color mov (0x1220), ebx std ebx 2
sta 0x1220 2 mov (0x1110), eax std eax 3
sta 0x1100 3 mov ecx, (0x1220) ld 0x1220 3 mov edx, (0x1280) ld 0x1280 3 mov (0x1400), edx std edx 4 sta 0x1400 4 mov edx, (0x1380) ld 0x1380 4
20
Memory Type Range Registers (MTRR)• Control registers written by the system (OS)• Supporting Memory TypesMemory Types
– UnCacheable (UC)– Uncacheable Speculative Write-combining (USWC or WC)
• Use a fill buffer entry as WC buffer– WriteBack (WB)– Write-Through (WT)– Write-Protected (WP)
• E.g. Support copy-on-write in UNIX, save memory space by allowing child processes to share with their parents. Only create new memory pages when child processes attempt to write.
• Page Miss Handler (PMH) – Look up MTRR while supplying physical addresses– Return memory types and physical address to DTLB
21
Intel NetBurst Microarchitecture• Pentium 4’s microarchitecture• Original target market: Graphics workstations,
but …• Design Goals:
– Performance, performance, performance, …– Unprecedented multimedia/floating-point
performance• Streaming SIMD Extensions 2 (SSE2)• SSE3 introduced in Prescott Pentium 4 (90nm)
– Reduced CPI• Low latency instructions• High bandwidth instruction fetching• Rapid Execution of Arithmetic & Logic operations
– Reduced clock period• New pipeline designed for scalability
22
Innovations Beyond P6• Hyperpipelined technology• Streaming SIMD Extension 2• Hyper-threading Technology (HT) • Execution trace cache• Rapid execution engine• Staggered adder unit• Enhanced branch predictor • Indirect branch predictor (also in Banias
Pentium M)• Load speculation and replay
23
Pentium 4 Fact Sheet• IA-32 fully backward compatible • Available at speeds ranging from 1.3 to ~3.8 GHz• Hyperpipelined (20+ stages)• 125 million transistors in Prescott (1.328 billion in 16MB on-die L3
Tulsa, 65nm)• 0.18 μ for 1.3 to 2GHz; 0.13μ for 1.8 to 3.4GHz; 90nm for 2.8GHz to
3.6GHz• Die Size of 122mm2 (Prescott 90nm), 435mm2 (Tulsa 65nm), • Consumes 115 watts of power at 3.6Ghz• 1066MHz system bus• Prescott L1 16KB, 8-way vs. previous P4’s 8KB 4-way• 1MB, 512KB or 256KB 8-way full-speed on-die L2 (B/W example:
89.6 GB/s @2.8GHz to L1)• 2MB L3 cache (in P4 HT Extreme edition, 0.13μ only), 16MB in Tulsa• 144 new 128 bit SIMD instructions (SSE2), and 16 SSSE instructions
in Prescott• HyperThreading Technology (Not in all versions)
24
Building Blocks of Netburst
Bus Unit
Level 2 Cache
Memory subsystem
Fetch/Dec
ETCμROM
BTB / Br Pred.
System bus
L1 Data Cache
Execution Units
INT and FP Exec. Unit
OOO logic Retire
Branch history update
Front-endOut-of-Order Engine
25
Pentium 4 Microarchitectue (Prescott)BTB (4k entries)BTB (4k entries) I-TLB/PrefetcherI-TLB/Prefetcher
IA32 DecoderIA32 Decoder
Execution Trace CacheExecution Trace Cache(12K (12K ops)ops)
Trace Cache BTBTrace Cache BTB(2k entries)(2k entries)
Code ROMCode ROM
op Queue op Queue
Allocator / Register RenamerAllocator / Register Renamer
INT / FP INT / FP op Queueop QueueMemory Memory op Queueop Queue
Memory schedulerMemory scheduler
INT Register File / Bypass NetworkINT Register File / Bypass Network FP RF / Bypass NtwkFP RF / Bypass Ntwk
AGUAGU AGUAGU 2x ALU2x ALU 2x ALU2x ALU Slow ALUSlow ALU
Ld addrLd addr St addrSt addr Simple Simple Inst.Inst.
Simple Simple Inst.Inst.
ComplexComplexInst.Inst.
FPFPMMX MMX
SSE/2/3SSE/2/3
FP FP MoveMove
L1 Data Cache (16KB 8-way, 64-byte line, WT, 1 rd + 1 wr port)L1 Data Cache (16KB 8-way, 64-byte line, WT, 1 rd + 1 wr port)
FastFast Slow/General FP schedulerSlow/General FP scheduler Simple FPSimple FP
Quad Quad PumpedPumped800MHz 800MHz
6.4 GB/sec6.4 GB/secBIUBIU
U-L2 Cache U-L2 Cache 1MB 8-way1MB 8-way
128B line, WB128B line, WB108 GB/s 108 GB/s
256 bits256 bits
64 bits64 bits64-bit 64-bit
SystemSystemBusBus
26
Pipeline Depth Evolution
PREF DEC DEC EXEC WB
P5 Microarchitecture
IFU1IFU1 IFU2IFU2 IFU3IFU3 DEC1DEC1 DEC2DEC2 RATRAT ROBROB DISDIS EXEX RET1RET1 RET2RET2
P6 P6 Microarchitecture
TC NextIP TC Fetch Drive Alloc QueueRename Schedule Dispatch Reg File Exec Flags Br Ck Drive
NetBurst Microarchitecture (Willamette)
20 stages
NetBurst Microarchitecture (Prescott)
> 30 stages
27
Execution Trace Cache• Primary first level I-cache to replace conventional L1
– Decoding several x86 instructions at high frequency is difficult, take several pipeline stages
– Branch misprediction penalty is considerable • Advantages
– Cache post-decode ops (think about fill unit)– High bandwidth instruction fetching– Eliminate x86 decoding overheads – Reduce branch recovery time if TC hits
• Hold up to 12,000 ops– 6 ops per trace line– Many (?) trace lines in a single trace
28
Execution Trace Cache• Deliver 3 op’s per cycle to OOO engine if br pred is good• X86 instructions read from L2 when TC misses (7+ cycle
latency)• TC Hit rate ~ 8K to 16KB conventional I-cache • Simplified x86 decoder
– Only one complex instruction per cycle– Instruction > 4 op will be executed by micro-code ROM (P6’s
MS)• Perform branch prediction in TC
– 512-entry BTB + 16-entry RAS– With BP in x86 IFU, reduce 33% misprediction compared to
P6 – Intel did not disclose the details of BP algorithms used in TC
and x86 IFU (Dynamic + Static)
29
Out-Of-Order Engine• Similar design philosophy with P6
uses– Allocator– Register Alias Table– 128 physical registers– 126-entry ReOrder Buffer– 48-entry load buffer– 24-entry store buffer
30
Register Renaming SchemesROB (40-entry)ROB (40-entry)
RRFRRF
DataData StatusStatus
EBXEBXECXECXEDXEDXESIESIEDIEDI
EAXEAX
ESPESPEBPEBP
RATRAT
P6 Register Renaming P6 Register Renaming
Allo
cate
d se
quen
tially
Allo
cate
d se
quen
tially
EBXEBXECXECXEDXEDXESIESIEDIEDI
EAXEAX
ESPESPEBPEBP
Retirement RATRetirement RAT
NetBurst Register Renaming NetBurst Register Renaming
StatusStatus
Allo
cate
d se
quen
tially
Allo
cate
d se
quen
tially
. . . . ..
. . . . ..
. . . . ..
. . . . ..
DataData
EBXEBXECXECXEDXEDXESIESIEDIEDI
EAXEAX
ESPESPEBPEBP
Front-end RATFront-end RAT RF (128-entry)RF (128-entry) ROB (126)ROB (126)
31
Micro-op Scheduling op FIFO queues
– Memory queue for loads and stores– Non-memory queue
op schedulers– Several schedulers fire instructions from 2 op queues to
execution (P6’s RS)– 4 distinct dispatch ports– Maximum dispatch: 6 ops per cycle (2 fast ALU from Port 0,1
per cycle; 1 from ld/st ports)Exec Port 0 Exec Port 1 Load Port Store Port
Fast ALU(2x pumped)
Fast ALU(2x pumped)
FP Move
INTExec
FP Exec
Memory Load
Memory Store
•Add/sub•Logic•Store Data•Branches
•FP/SSE Move•FP/SSE Store•FXCH
•Add/sub •Shift•Rotate
•FP/SSE Add•FP/SSE Mul•FP/SSE Div•MMX
•Loads•LEA•Prefetch
•Stores
32
Data Memory Accesses• Prescott: 16KB 8-way L1 + 1MB 8-way L2 (with a HW prefetcher),
128B line• Load-to-use speculation
– Dependent instruction dispatched before load finishes• Due to the high frequency and deep pipeline depth• From load scheduler to execution is longer than execution
itself– Scheduler assumes loads always hit L1– If L1 miss, dependent instructions left the scheduler receive
incorrect data temporarily – mis-speculationmis-speculation– Replay logic Replay logic
• Re-execute the load when mis-speculated• Mis-speculated operations are placed into a replay queue for
being re-dispatched– All trailing independent instructions are allowed to proceed– Tornado breaker
• Up to 4 outstanding load misses (= 4 fill buffers in original P6)• Store-to-load forwarding buffer
– 24 entries– Have the same starting physical address– Load data size <= store data size
33
Fast Staggered ALU
• For frequent ALU instructions (No multiply, no shift, no rotate, no branch processing)
• Double pumped clocks• Each operation finishes in 3 fast cycles
– Lower-order 16-bit and bypass– Higher-order 16-bit and bypass– ALU flags generation
Bit[15:0]
Bit[31:16]
Flags
34
Branch Predictor • P4 uses the same hybrid predictor of
Pentium M
Bimodal Predictor
Local Predictor
Global Predictor
MUX
MUX
Pred_G
Pred_LPred_B
L_hit
G_hit
35
• In Pentium M and Prescott Pentium 4• Prediction based on global history
Indirect Branch Predictor
36
New Instructions over Pentium• CMOVcc / FCMOVcc r, r/m
– Conditional moves (predicated move) instructions
– Based on conditional code (cc)
• FCOMI/P : compare FP stack and set integer flags
• RDPMC/RDTSC instructions– PMC: P6 has 2, Netburst (P4) has 18
• Uncacheable Speculative Write-Combining (USWC) —weakly ordered memory type for graphics memory
37
New Instructions• SSE2 in Pentium 4 (not p6 microarchitecture)
– Double precision SIMD FP• SSSE in Core 2
– Supplemental instructions for shuffle, align, add, subtract.
• Intel 64 (EM64T)– 64 bit support, new registers (8 more on top of 8) – In Celeron D, Core 2 (and P4 Prescott, Pentium D)– Almost compatible with AMD64– AMD’s NX bit or Intel’s XD bit for preventing buffer
overflow attacks
38
Streaming SIMD Extension 2• P-III SSE (Katmai New Instructions: KNI)
– Eight 128-bit wide xmmxmm registers (new architecture state)– Single-precisionSingle-precision 128-bit SIMD FP
• Four 32-bit FP operations in one instruction• Broken down into 2 ops for execution (only 80-bit data in ROB)
– 64-bit SIMD MMX (use 8 mmmm registers — map to FP stack)– Prefetch (nta, t0, t1, t2) and sfence
• P4 SSE2 (Willamette New Instructions: WNI) – Support Double-precision Double-precision 128-bit SIMD FP
• Two 64-bit FP operations in one instruction• Throughput: 2 cycles for most of SSE2 operations (exceptional
examples: DIVPD and SQRTPD: 69 cycles, non-pipelined.)– Enhanced 128-bit SIMD MMX using xmmxmm registers
39
Examples of Using SSEX3X3 X2X2 X1X1 X0X0
Y3Y3 Y2Y2 Y1Y1 Y0Y0opop opop opop opop
X3 op Y3X3 op Y3 X2 op Y2X2 op Y2 X1 op Y1X1 op Y1X0 op Y0X0 op Y0
X3X3 X2X2 X1X1 X0X0
Y3Y3 Y2Y2 Y1Y1 Y0Y0opop
X0 op Y0X0 op Y0X3X3 X2X2 X1X1Packed SP FP operationPacked SP FP operation(e.g. (e.g. ADDPS xmm1, xmm2ADDPS xmm1, xmm2))
Scalar SP FP operation Scalar SP FP operation (e.g. (e.g. ADDSS xmm1, xmm2ADDSS xmm1, xmm2))
xmm1xmm1
xmm2xmm2
xmm1xmm1
xmm1xmm1
xmm2xmm2
xmm1xmm1
Shuffle FP operation (8-bit imm)Shuffle FP operation (8-bit imm)(e.g. (e.g. SHUFPS xmm1, xmm2, SHUFPS xmm1, xmm2, imm8imm8))
X3X3 X2X2 X1X1 X0X0
Y3Y3 Y2Y2 Y1Y1 Y0Y0
Y3 .. Y0Y3 .. Y0 Y3 .. Y0Y3 .. Y0 X3 .. X0X3 .. X0 X3 .. X0X3 .. X0
Shuffle FP operation (8-bit imm)Shuffle FP operation (8-bit imm)(e.g. (e.g. SHUFPS xmm1, xmm2,SHUFPS xmm1, xmm2, 0xf1 0xf1) )
X3X3 X2X2 X1X1 X0X0
Y3Y3 Y2Y2 Y1Y1 Y0Y0
xmm1xmm1
Y3Y3 X0X0 X1X1Y3Y3
xmm2xmm2
xmm1xmm1
40
Examples of Using SSE and SSE2X3X3 X2X2 X1X1 X0X0
Y3Y3 Y2Y2 Y1Y1 Y0Y0opop opop opop opop
X3 op Y3X3 op Y3 X2 op Y2X2 op Y2 X1 op Y1X1 op Y1X0 op Y0X0 op Y0
X3X3 X2X2 X1X1 X0X0
Y3Y3 Y2Y2 Y1Y1 Y0Y0opop
X0 op Y0X0 op Y0X3X3 X2X2 X1X1Packed Packed SPSP FP operation FP operation(e.g. (e.g. ADDPS xmm1, xmm2ADDPS xmm1, xmm2))
Scalar Scalar SPSP FP operation FP operation (e.g. (e.g. ADDSS xmm1, xmm2ADDSS xmm1, xmm2))
xmm1xmm1
xmm2xmm2
xmm1xmm1
xmm1xmm1
xmm2xmm2
xmm1xmm1
Shuffle FP operation Shuffle FP operation (e.g. (e.g. SHUFPS xmm1, xmm2, imm8SHUFPS xmm1, xmm2, imm8))
X3X3 X2X2 X1X1 X0X0
Y3Y3 Y2Y2 Y1Y1 Y0Y0
Y3 .. Y0Y3 .. Y0 Y3 .. Y0Y3 .. Y0 X3 .. X0X3 .. X0 X3 .. X0X3 .. X0
Shuffle Shuffle FPFP operation (8-bit imm) operation (8-bit imm) (e.g. (e.g. SHUFPS xmm1, xmm2,SHUFPS xmm1, xmm2, 0xf1 0xf1) )
X3X3 X2X2 X1X1 X0X0
Y3Y3 Y2Y2 Y1Y1 Y0Y0
xmm1xmm1
Y3Y3 X0X0 X1X1Y3Y3
xmm2xmm2
xmm1xmm1
X0X0
opop
Packed Packed DPDP FP operation FP operation(e.g. (e.g. ADDPDADDPD xmm1, xmm2xmm1, xmm2))
Scalar Scalar DPDP FP operation FP operation (e.g. (e.g. ADDSDADDSD xmm1, xmm2xmm1, xmm2))
xmm1xmm1
xmm2xmm2
xmm1xmm1
xmm1xmm1
xmm2xmm2
xmm1xmm1
Shuffle FP operation Shuffle FP operation (e.g. (e.g. SHUFPS xmm1, xmm2, imm8SHUFPS xmm1, xmm2, imm8))Shuffle Shuffle DPDP operation (2-bit imm) operation (2-bit imm)(e.g. (e.g. SHUFPD xmm1, xmm2, SHUFPD xmm1, xmm2, imm2imm2) )
X1X1
Y0Y0Y1Y1
X0 op Y0X0 op Y0X1 op Y1X1 op Y1
opop
X0X0X1X1
Y0Y0Y1Y1
X0 op Y0X0 op Y0X1 X1
opop
X0X0X1X1
Y0Y0Y1Y1
X1 or X0X1 or X0Y1 or Y0 Y1 or Y0
SSESSE
SSE2SSE2
41
HyperThreading • Intel Xeon Processor and Intel Xeon MP Processor• Enable Simultaneous Multi-Threading (SMT)
– Exploit ILP through TLP (—Thread-Level Parallelism)– Issuing and executing multiple threads at the same snapshot
• Single P4 w/ HT appears to be 2 logical processors2 logical processors• Share the same execution resources
– dTLB shared with logical processor ID– Some other shared resources are partitioned (next slide)
• Architectural states and some microarchitectural states are duplicated– IPs, iTLB, streaming buffer– Architectural register file– Return stack buffer– Branch history buffer– Register Alias Table
42
Multithreading (MT) Paradigms
Thread 1Unused
Exec
utio
n Ti
me
FU1 FU2 FU3 FU4
ConventionalSuperscalar
SingleThreaded
SimultaneousMultithreading(or Intel’s HT)
Fine-grainedMultithreading(cycle-by-cycle
Interleaving)
Thread 2Thread 3Thread 4Thread 5
Coarse-grainedMultithreading
(Block Interleaving)
Chip Multiprocessor
(CMP) or called
Multi-Core Processorstoday
43
HyperThreading Resource Partitioning• TC (or UROM) is alternatively accessed per
cycle for each logical processor unless one is stalled due to TC miss
op queue (into ½) after fetched from TC• ROB (126/2)• LB (48/2)• SB (24/2) (32/2 for Prescott)• General op queue and memory op queue
(1/2) • TLB (½?) as there is no PID• Retirement: alternating between 2 logical
processors