Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netburst, Pentiium 4

ECE 4100/6100Advanced Computer Architecture

Lecture 12 P6 and NetBurst Microarchitecture

Prof. Hsien-Hsin Sean LeeSchool of Electrical and Computer EngineeringGeorgia Institute of Technology

2

P6 System Architecture

System System Memory Memory (DRAM)(DRAM)

MCHMCH

Front-Side Front-Side BusBus

PCI USB I/O

GraphicsGraphicsProcessor Processor

LocalFrameBuffer

PCIExpressAGP

(SRAM)(SRAM)L2 CacheL2 Cache

Back-SideBack-Side

BusBus

P6 CoreP6 Core

Host ProcessorHost Processor

L1L1CacheCache

(SRAM)(SRAM)

GPUGPU

ICHICHchipsetchipset

3

Instruction Fetch UnitInstruction Fetch Unit

P6 Microarchitecture

BTB/BACBTB/BAC

Instruction Fetch UnitInstruction Fetch Unit

Bus interface unitBus interface unit

InstructionInstruction

DecoderDecoder

InstructionInstruction

DecoderDecoderRegister Register Alias TableAlias Table

AllocatorAllocatorMicrocode Microcode SequencerSequencer

Reservation Reservation StationStation

ROB & ROB & Retire RFRetire RF

AGUAGU

MMXMMX

IEU/JEUIEU/JEUIEU/JEUIEU/JEU

FEUFEU

MIUMIU

Memory Memory Order BufferOrder Buffer

Data Cache Data Cache Unit (L1) Unit (L1)

External busExternal busChip boundaryChip boundary

Control Control FlowFlow

(Restricted)(Restricted)DataDataFlowFlowInstruction Fetch Cluster

Issue Cluster

Out-of-orderCluster

MemoryCluster

Bus Cluster

4

Pentium III Die Map EBL/BBL – External/Backside Bus logic MOB - Memory Order Buffer Packed FPU - Floating Point Unit for SSE IEU - Integer Execution Unit FAU - Floating Point Arithmetic Unit MIU - Memory Interface Unit DCU - Data Cache Unit (L1) PMH - Page Miss Handler DTLB - Data TLB BAC - Branch Address Calculator RAT - Register Alias Table SIMD - Packed Floating Point unit RS - Reservation Station BTB - Branch Target Buffer TAP – Test Access Port IFU - Instruction Fetch Unit and L1 I-Cache ID - Instruction Decode ROB - Reorder Buffer MS - Micro-instruction Sequencer

5

P6 Basics• One implementation of IA32 architecture• Deeply pipeline processor• In-order front-end and back-end• Dynamic execution engine (restricted dataflow)• Speculative execution• P6 microarchitecture family processors include

– Pentium Pro – Pentium II (PPro + MMX + 2x caches)– Pentium III (P-II + SSE + enhanced MMX, e.g. PSAD)– Pentium 4 (Not P6, will be discussed separately)– Pentium M (+SSE2, SSE3, op fusion)– Core (PM + SSSE3, SSE4, Intel 64 (EM64T), MacroOp

fusion, 4 op retired rate vs. 3 of previous proliferation)

6

P6 Pipelining

1111 1212 1313 1414 1515 1616 17172020 2121 2222

Next

IPNe

xt IP

I-Cac

heI-C

ache

ILD

ILD

Rota

teRo

tate

Dec1

Dec1

Dec2

Dec2

Br D

ecBr

Dec

RS W

rite

RS W

rite

RAT

RAT

IDQ

IDQ

In-order FEIn-order FE

3131 3232 33338181 8282.... ....

8383

Exec

2Ex

ec2

Exec

nEx

ec n

Multi-cycle Multi-cycle pipelinepipeline

3131 3232 33338181 8282

4242 43438383

AGU

AGU

DCac

he1

DCac

he1

DCac

he2

DCac

he2

Non-blocking Non-blocking memory pipelinememory pipeline

3131 3232 33338282 8383

RS s

chd

RS s

chd

RS D

isp

RS D

isp

Exec

/ W

BEx

ec /

WB

Single-cycle Single-cycle pipelinepipeline

83: Data WB83: Data WB82: Int WB schedule82: Int WB schedule81: Mem/FP WB81: Mem/FP WB

FE in

-ord

er b

ound

ary

FE in

-ord

er b

ound

ary

Retir

emen

t in-

orde

r bou

ndar

yRe

tirem

ent i

n-or

der b

ound

ary

9191 9292 9393

Ret p

tr wr

Ret p

tr wr

Ret R

OB rd

Ret R

OB rd

RRF

wrRR

F wr

…

…

…

… ……..

RS Scheduling RS Scheduling DelayDelay

ROB Scheduling ROB Scheduling DelayDelay

MOB Scheduling MOB Scheduling DelayDelay

IFU1IFU1 IFU2IFU2 IFU3IFU3 DEC1DEC1 DEC2DEC2 RATRAT ROBROB DISDIS EXEX RET1RET1 RET2RET2

3131 3232 33338181 8282

4242 43438383

AGU

AGU

MOB

blk

MOB

blk

MOB

wr

MOB

wr

4040 4141 4242 4343

MOB

dis

pM

OB d

isp

DCac

he1

DCac

he1

Dcac

he2

Dcac

he2

MOB

wak

eup

MOB

wak

eup

Blocking Blocking memory memory pipelinepipeline

7

Instruction Fetching Unit

• IFU1: Initiate fetch, requesting 16 bytes at a time• IFU2: Instruction length decoder, mark instruction boundaries, BTB makes

prediction (2 cycles)• IFU3: Align instructions to 3 decoders in 4-1-1 format

Streaming Buffer

Instruction Cache

Victim Cache

Instruction TLB

data addr

P.Addr

Branch Target Buffer

Next PCMux

Other fetch requests

Line

ar A

ddre

ss

Select mux

ILDLength marks

Instruction rotator

Instruction buffer

#bytes consumed by ID

Prediction marks

8

Static Branch Prediction (stage 17 Br. Dec of pg. 6)

BTB miss?BTB miss?

PC-relative?PC-relative?

Conditional?Conditional?

Backwards?Backwards?

Return?Return?

Unconditional Unconditional PC-relative?PC-relative?

NoNoNoNo

NoNo NoNo

NoNo

NoNo

YesYes

YesYes

YesYes

YesYes

YesYes

YesYes

BTB dynamic BTB dynamic predictor’s predictor’s decisiondecision

TakenTaken

TakenTakenTakenTaken

TakenTaken

TakenTaken

Indirect Indirect jumpjump

Not TakenNot Taken

9

Dynamic Branch Prediction

• Similar to a 2-level PAs design• Associated with each BTB entry• W/ 16-entry Return Stack Buffer • 4 branch predictions per cycle

(due to 16-byte fetch per cycle)• Speculative update (2 copies of

BHR)

Static prediction provided by Branch Address Calculator when BTB misses (see prior slide)

512-entry BTB 512-entry BTB 1 1 0Branch History RegisterBranch History Register

(BHR)(BHR)

0000 0001 0010

1111 1110

Pattern History Tables Pattern History Tables (PHT)(PHT)

Prediction

Rc: Branch ResultRc: Branch Result2-bit sat. counter

11 00

1 10 Spec. updateSpec. update

New (spec) historyNew (spec) history

1101

W0W0 W1W1 W2W2 W3W3

10

X86 Instruction Decode

• 4-1-1 decoder• Decode rate depends on instruction alignment• DEC1: translate x86 into micro-operation’s (ops) • DEC2: move decoded ops to ID queue• MS performs translations either

– Generate entire op sequence from the “microcode ROM”– Receive 4 ops from complex decoder, and the rest from microcode ROM

• Subsequent Instructions followed by the inst needing MS are flushed

complex(1-4)

simple(1)

simple(1)

(16 bytes)

Micro-instruction sequencer

(MS)

Instruction decoder queue(6 ops)

Next 3 inst #Inst to decS,S,S 3S,S,C First 2

S,C,S First 1

S,C,C First 1

C,S,S 3C,S,C First 2

C,C,S First 1

C,C,C First 1

S: SimpleC: Complex

Instruction Buffer

To RAT/ALLOC

11

Register Alias Table (RAT)

• Register renaming for 8 integer registers, 8 floating point (stack) registers and flags: 3 op per cycle

• 40 80-bit physicalphysical registers embedded in the ROB (thereby, 6 bit to specify PSrcPSrc)• RAT looks up physical ROB locations for renamed sources based on RRF bit• Override logic is for dependent ops decoded at the same cycle• Misprediction will revert all pointers to point to Retirement Register File (RRF)

In-o

rder

qu

eue

FP TOS Adjust

FP RAT Array

Integer RAT

Array

Logical Src

Int a

nd F

P O

verr

ides

Array Physical Src (Psrc)

RAT PSrc’s

Physical ROB Pointers

Allocator

25 2

ECX15

EAXEBXECXEDX

Renaming Example

ROBRRF

RRF PSrc0

0

0

1

12

Partial Stalls due to RAT

• Partial register stalls: Occurs when writing a smaller (e.g. 8/16-bit) register followed by a larger (e.g. 32-bit) read – Because need to read different partial pieces from multiple physical registers !

• Partial flags stalls: Occurs when a subsequent instruction reads more flags than a prior unretired instruction touches

EAXEAXAXAX writewritereadread

MOV AX, m8 ; MOV AX, m8 ; ADD EAX, m32 ; stallADD EAX, m32 ; stall

Partial register stallsPartial register stalls

XOR EAX, EAX XOR EAX, EAX MOV AL, m8 ; MOV AL, m8 ; ADD EAX, m32 ; no stallADD EAX, m32 ; no stall

SUB EAX, EAX SUB EAX, EAX MOV AL, m8 ; MOV AL, m8 ; ADD EAX, m32 ; no stallADD EAX, m32 ; no stall

Idiom Fix (1)Idiom Fix (1)

Idiom Fix (2)Idiom Fix (2)

CMP EAX, EBX CMP EAX, EBX INC ECX INC ECX JBE XX ; stallJBE XX ; stall

Partial flag stalls (1)Partial flag stalls (1)

JBEJBE reads both ZFZF and CFCF while INC affects (ZFZF,OF,SF,AF,PF) i.e. only ZFZF

LAHF LAHF loads low byte of EFLAGS EFLAGS while TEST TEST writes partial of them

TEST EBX, EBX TEST EBX, EBX LAHF ; stallLAHF ; stall

Partial flag stalls (2)Partial flag stalls (2)

13

Partial Register Width Renaming

• 32/16-bit accesses:– Read from low bank low bank

(AL/BL/CL/DL;AX/BX/CX/DX;EAX/EBX/ECX/EDX/EDI/ESI/EBP/ESP)– Write to both banks (AH/BH/CH/DH)

• 8-bit RAT accesses: depending on which bank is being written and only update the particular bank

In-o

rder

que

ue

FP TOS Adjust

FP RAT Array

Logical Src

Int a

nd F

P O

verri

es

Array Physical Src

RAT Physical Src

Physical ROB Pointers from Allocator

op0: MOV AL = (a)op1: MOV AH = (b)op2: ADD AL = (c)op3: ADD AH = (d)

Integer RAT Array

INT Low Bank (32b/16b/L): 8 entries

INT High Bank (H): 4 entries

Size(2) RRF(1) PSrc(6)

Allocator

14

Allocator (ALLOC)• The interface between in-order and out-of-

order pipelines• Allocates into ROB, MOB and RS

– “3-or-none” ops per cycle into ROB and RS• Must have 3 free ROB entries or no allocation

– “all-or-none” policy for MOB• Stall allocation when not all the valid MOB ops can be

allocated • Generate physical destination token PdstPdst

from the ROB and pass it to the Register Alias Table (RAT) and RS

• Stalls upon shortage of resources

15

Reservation Stations (RS)

• Gateway to execution: binding max 5 op to each port per cycle • Port binding at dispatch time (certain op can only be bound to one port)• 20 op entry buffer bridging the In-order and Out-of-order engine (32 entries in

Core)• RS fields include op opcode, data valid bits, Pdst, Psrc, source data, BrPred, etc.• Oldest first FIFO scheduling when multiple ops are ready at the same cycle

Port 0

Port 1

Port 2

Port 3

Port 4

IEU0 Fadd Fmul Imul Div

IEU1 JEU

AGU0

AGU1

MOB DCU

ROB RRF

Pfadd

Pfmul

Pfshuf

WB bus 1

WB bus 0

Ld addr

St addr

LDA

STA

STDSt data

Loaded data

RS

Retired data

16

ReOrder Buffer (ROB)• A 40-entry circular buffer (96-entry in

Core)– 157-bit wide– Provide 40 alias physical registers

• Out-of-order completion • Deposit exception in each entry• Retirement (or de-allocation)

– After resolving prior speculation– Handle exceptions thru MS– Clear OOO state when a mis-

predicted branch or exception is detected

– 3 op’s per cycle in program order– For multi-op x86 instructions:

none or all (atomic)

ALLOC

RAT

RS

RRFROB. . .

MS

(exp) code assist

17

Memory Execution Cluster

• Manage data memory accesses• Address Translation• Detect violation of access ordering• Fill buffers (FB) in DCU, similar to MSHR for non-blocking cache support

RS / ROBRS / ROB

LDLD STASTA STDSTD

DTLBDTLB

LDLD STASTADCUDCU

Load BufferLoad Buffer

Store BufferStore BufferEBLEBL

Memory ClusterMemory Cluster

movl ecx, edi addl ecx, 8 movl -4(edi), ebx

movl eax, 4(ecx)

RS cannot detect this and could dispatch them at the same timeFBFB

18

Memory Order Buffer (MOB)• Allocated by ALLOC• A second order RS for memory operations• 1 op for load; 2 op’s for store: Store Address (STA) and Store Data (STD)• MOB

16-entry load buffer (LB) (32-entry in Core, 64 in SandyBridge) 12-entry store address buffer (SAB) (20-entry in Core, 36 in

SandyBridge) SAB works in unison with

• Store data buffer (SDB) in MIU• Physical Address Buffer (PAB) in DCU

Store Buffer (SB): SAB + SDB + PAB• Senior Stores

Upon STD/STA retired from ROB SB marks the store “seniorsenior” Senior stores are committed back in program orderprogram order to memory when

bus idle or SB full• Prefetch instructions in P-III

Senior loadSenior load behavior Due to no explicit architectural destination

New Memory dependency predictor in Core to predict store-to-load dependencies

19

Store Coloring

• ALLOC assigns Store Buffer ID (SBID) in program order• ALLOC tags loads with the most recent SBID• Check loads against stores with equal or younger SBIDs for

potential address conflicts • SDB forwards data if conflict detected

x86 Instructions op’s store color mov (0x1220), ebx std ebx 2

sta 0x1220 2 mov (0x1110), eax std eax 3

sta 0x1100 3 mov ecx, (0x1220) ld 0x1220 3 mov edx, (0x1280) ld 0x1280 3 mov (0x1400), edx std edx 4 sta 0x1400 4 mov edx, (0x1380) ld 0x1380 4

20

Memory Type Range Registers (MTRR)• Control registers written by the system (OS)• Supporting Memory TypesMemory Types

– UnCacheable (UC)– Uncacheable Speculative Write-combining (USWC or WC)

• Use a fill buffer entry as WC buffer– WriteBack (WB)– Write-Through (WT)– Write-Protected (WP)

• E.g. Support copy-on-write in UNIX, save memory space by allowing child processes to share with their parents. Only create new memory pages when child processes attempt to write.

• Page Miss Handler (PMH) – Look up MTRR while supplying physical addresses– Return memory types and physical address to DTLB

21

Intel NetBurst Microarchitecture• Pentium 4’s microarchitecture• Original target market: Graphics workstations,

but …• Design Goals:

– Performance, performance, performance, …– Unprecedented multimedia/floating-point

performance• Streaming SIMD Extensions 2 (SSE2)• SSE3 introduced in Prescott Pentium 4 (90nm)

– Reduced CPI• Low latency instructions• High bandwidth instruction fetching• Rapid Execution of Arithmetic & Logic operations

– Reduced clock period• New pipeline designed for scalability

22

Innovations Beyond P6• Hyperpipelined technology• Streaming SIMD Extension 2• Hyper-threading Technology (HT) • Execution trace cache• Rapid execution engine• Staggered adder unit• Enhanced branch predictor • Indirect branch predictor (also in Banias

Pentium M)• Load speculation and replay

23

Pentium 4 Fact Sheet• IA-32 fully backward compatible • Available at speeds ranging from 1.3 to ~3.8 GHz• Hyperpipelined (20+ stages)• 125 million transistors in Prescott (1.328 billion in 16MB on-die L3

Tulsa, 65nm)• 0.18 μ for 1.3 to 2GHz; 0.13μ for 1.8 to 3.4GHz; 90nm for 2.8GHz to

3.6GHz• Die Size of 122mm2 (Prescott 90nm), 435mm2 (Tulsa 65nm), • Consumes 115 watts of power at 3.6Ghz• 1066MHz system bus• Prescott L1 16KB, 8-way vs. previous P4’s 8KB 4-way• 1MB, 512KB or 256KB 8-way full-speed on-die L2 (B/W example:

89.6 GB/s @2.8GHz to L1)• 2MB L3 cache (in P4 HT Extreme edition, 0.13μ only), 16MB in Tulsa• 144 new 128 bit SIMD instructions (SSE2), and 16 SSSE instructions

in Prescott• HyperThreading Technology (Not in all versions)

24

Building Blocks of Netburst

Bus Unit

Level 2 Cache

Memory subsystem

Fetch/Dec

ETCμROM

BTB / Br Pred.

System bus

L1 Data Cache

Execution Units

INT and FP Exec. Unit

OOO logic Retire

Branch history update

Front-endOut-of-Order Engine

25

Pentium 4 Microarchitectue (Prescott)BTB (4k entries)BTB (4k entries) I-TLB/PrefetcherI-TLB/Prefetcher

IA32 DecoderIA32 Decoder

Execution Trace CacheExecution Trace Cache(12K (12K ops)ops)

Trace Cache BTBTrace Cache BTB(2k entries)(2k entries)

Code ROMCode ROM

op Queue op Queue

Allocator / Register RenamerAllocator / Register Renamer

INT / FP INT / FP op Queueop QueueMemory Memory op Queueop Queue

Memory schedulerMemory scheduler

INT Register File / Bypass NetworkINT Register File / Bypass Network FP RF / Bypass NtwkFP RF / Bypass Ntwk

AGUAGU AGUAGU 2x ALU2x ALU 2x ALU2x ALU Slow ALUSlow ALU

Ld addrLd addr St addrSt addr Simple Simple Inst.Inst.

Simple Simple Inst.Inst.

ComplexComplexInst.Inst.

FPFPMMX MMX

SSE/2/3SSE/2/3

FP FP MoveMove

L1 Data Cache (16KB 8-way, 64-byte line, WT, 1 rd + 1 wr port)L1 Data Cache (16KB 8-way, 64-byte line, WT, 1 rd + 1 wr port)

FastFast Slow/General FP schedulerSlow/General FP scheduler Simple FPSimple FP

Quad Quad PumpedPumped800MHz 800MHz

6.4 GB/sec6.4 GB/secBIUBIU

U-L2 Cache U-L2 Cache 1MB 8-way1MB 8-way

128B line, WB128B line, WB108 GB/s 108 GB/s

256 bits256 bits

64 bits64 bits64-bit 64-bit

SystemSystemBusBus

26

Pipeline Depth Evolution

PREF DEC DEC EXEC WB

P5 Microarchitecture

IFU1IFU1 IFU2IFU2 IFU3IFU3 DEC1DEC1 DEC2DEC2 RATRAT ROBROB DISDIS EXEX RET1RET1 RET2RET2

P6 P6 Microarchitecture

TC NextIP TC Fetch Drive Alloc QueueRename Schedule Dispatch Reg File Exec Flags Br Ck Drive

NetBurst Microarchitecture (Willamette)

20 stages

NetBurst Microarchitecture (Prescott)

> 30 stages

27

Execution Trace Cache• Primary first level I-cache to replace conventional L1

– Decoding several x86 instructions at high frequency is difficult, take several pipeline stages

– Branch misprediction penalty is considerable • Advantages

– Cache post-decode ops (think about fill unit)– High bandwidth instruction fetching– Eliminate x86 decoding overheads – Reduce branch recovery time if TC hits

• Hold up to 12,000 ops– 6 ops per trace line– Many (?) trace lines in a single trace

28

Execution Trace Cache• Deliver 3 op’s per cycle to OOO engine if br pred is good• X86 instructions read from L2 when TC misses (7+ cycle

latency)• TC Hit rate ~ 8K to 16KB conventional I-cache • Simplified x86 decoder

– Only one complex instruction per cycle– Instruction > 4 op will be executed by micro-code ROM (P6’s

MS)• Perform branch prediction in TC

– 512-entry BTB + 16-entry RAS– With BP in x86 IFU, reduce 33% misprediction compared to

P6 – Intel did not disclose the details of BP algorithms used in TC

and x86 IFU (Dynamic + Static)

29

Out-Of-Order Engine• Similar design philosophy with P6

uses– Allocator– Register Alias Table– 128 physical registers– 126-entry ReOrder Buffer– 48-entry load buffer– 24-entry store buffer

30

Register Renaming SchemesROB (40-entry)ROB (40-entry)

RRFRRF

DataData StatusStatus

EBXEBXECXECXEDXEDXESIESIEDIEDI

EAXEAX

ESPESPEBPEBP

RATRAT

P6 Register Renaming P6 Register Renaming

Allo

cate

d se

quen

tially

Allo

cate

d se

quen

tially


EAXEAX

ESPESPEBPEBP

Retirement RATRetirement RAT

NetBurst Register Renaming NetBurst Register Renaming

StatusStatus

Allo

cate

d se

quen

tially

Allo

cate

d se

quen

tially

. . . . ..

. . . . ..

. . . . ..

. . . . ..

DataData


EAXEAX

ESPESPEBPEBP

Front-end RATFront-end RAT RF (128-entry)RF (128-entry) ROB (126)ROB (126)

31

Micro-op Scheduling op FIFO queues

– Memory queue for loads and stores– Non-memory queue

op schedulers– Several schedulers fire instructions from 2 op queues to

execution (P6’s RS)– 4 distinct dispatch ports– Maximum dispatch: 6 ops per cycle (2 fast ALU from Port 0,1

per cycle; 1 from ld/st ports)Exec Port 0 Exec Port 1 Load Port Store Port

Fast ALU(2x pumped)

Fast ALU(2x pumped)

FP Move

INTExec

FP Exec

Memory Load

Memory Store

•Add/sub•Logic•Store Data•Branches

•FP/SSE Move•FP/SSE Store•FXCH

•Add/sub •Shift•Rotate

•FP/SSE Add•FP/SSE Mul•FP/SSE Div•MMX

•Loads•LEA•Prefetch

•Stores

32

Data Memory Accesses• Prescott: 16KB 8-way L1 + 1MB 8-way L2 (with a HW prefetcher),

128B line• Load-to-use speculation

– Dependent instruction dispatched before load finishes• Due to the high frequency and deep pipeline depth• From load scheduler to execution is longer than execution

itself– Scheduler assumes loads always hit L1– If L1 miss, dependent instructions left the scheduler receive

incorrect data temporarily – mis-speculationmis-speculation– Replay logic Replay logic

• Re-execute the load when mis-speculated• Mis-speculated operations are placed into a replay queue for

being re-dispatched– All trailing independent instructions are allowed to proceed– Tornado breaker

• Up to 4 outstanding load misses (= 4 fill buffers in original P6)• Store-to-load forwarding buffer

– 24 entries– Have the same starting physical address– Load data size <= store data size

33

Fast Staggered ALU

• For frequent ALU instructions (No multiply, no shift, no rotate, no branch processing)

• Double pumped clocks• Each operation finishes in 3 fast cycles

– Lower-order 16-bit and bypass– Higher-order 16-bit and bypass– ALU flags generation

Bit[15:0]

Bit[31:16]

Flags

34

Branch Predictor • P4 uses the same hybrid predictor of

Pentium M

Bimodal Predictor

Local Predictor

Global Predictor

MUX

MUX

Pred_G

Pred_LPred_B

L_hit

G_hit

35

• In Pentium M and Prescott Pentium 4• Prediction based on global history

Indirect Branch Predictor

36

New Instructions over Pentium• CMOVcc / FCMOVcc r, r/m

– Conditional moves (predicated move) instructions

– Based on conditional code (cc)

• FCOMI/P : compare FP stack and set integer flags

• RDPMC/RDTSC instructions– PMC: P6 has 2, Netburst (P4) has 18

• Uncacheable Speculative Write-Combining (USWC) —weakly ordered memory type for graphics memory

37

New Instructions• SSE2 in Pentium 4 (not p6 microarchitecture)

– Double precision SIMD FP• SSSE in Core 2

– Supplemental instructions for shuffle, align, add, subtract.

• Intel 64 (EM64T)– 64 bit support, new registers (8 more on top of 8) – In Celeron D, Core 2 (and P4 Prescott, Pentium D)– Almost compatible with AMD64– AMD’s NX bit or Intel’s XD bit for preventing buffer

overflow attacks

38

Streaming SIMD Extension 2• P-III SSE (Katmai New Instructions: KNI)

– Eight 128-bit wide xmmxmm registers (new architecture state)– Single-precisionSingle-precision 128-bit SIMD FP

• Four 32-bit FP operations in one instruction• Broken down into 2 ops for execution (only 80-bit data in ROB)

– 64-bit SIMD MMX (use 8 mmmm registers — map to FP stack)– Prefetch (nta, t0, t1, t2) and sfence

• P4 SSE2 (Willamette New Instructions: WNI) – Support Double-precision Double-precision 128-bit SIMD FP

• Two 64-bit FP operations in one instruction• Throughput: 2 cycles for most of SSE2 operations (exceptional

examples: DIVPD and SQRTPD: 69 cycles, non-pipelined.)– Enhanced 128-bit SIMD MMX using xmmxmm registers

39

Examples of Using SSEX3X3 X2X2 X1X1 X0X0

Y3Y3 Y2Y2 Y1Y1 Y0Y0opop opop opop opop

X3 op Y3X3 op Y3 X2 op Y2X2 op Y2 X1 op Y1X1 op Y1X0 op Y0X0 op Y0

X3X3 X2X2 X1X1 X0X0

Y3Y3 Y2Y2 Y1Y1 Y0Y0opop

X0 op Y0X0 op Y0X3X3 X2X2 X1X1Packed SP FP operationPacked SP FP operation(e.g. (e.g. ADDPS xmm1, xmm2ADDPS xmm1, xmm2))

Scalar SP FP operation Scalar SP FP operation (e.g. (e.g. ADDSS xmm1, xmm2ADDSS xmm1, xmm2))

xmm1xmm1

xmm2xmm2

xmm1xmm1

xmm1xmm1

xmm2xmm2

xmm1xmm1

Shuffle FP operation (8-bit imm)Shuffle FP operation (8-bit imm)(e.g. (e.g. SHUFPS xmm1, xmm2, SHUFPS xmm1, xmm2, imm8imm8))

X3X3 X2X2 X1X1 X0X0

Y3Y3 Y2Y2 Y1Y1 Y0Y0

Y3 .. Y0Y3 .. Y0 Y3 .. Y0Y3 .. Y0 X3 .. X0X3 .. X0 X3 .. X0X3 .. X0

Shuffle FP operation (8-bit imm)Shuffle FP operation (8-bit imm)(e.g. (e.g. SHUFPS xmm1, xmm2,SHUFPS xmm1, xmm2, 0xf1 0xf1) )

X3X3 X2X2 X1X1 X0X0

Y3Y3 Y2Y2 Y1Y1 Y0Y0

xmm1xmm1

Y3Y3 X0X0 X1X1Y3Y3

xmm2xmm2

xmm1xmm1

40

Examples of Using SSE and SSE2X3X3 X2X2 X1X1 X0X0

Y3Y3 Y2Y2 Y1Y1 Y0Y0opop opop opop opop

X3 op Y3X3 op Y3 X2 op Y2X2 op Y2 X1 op Y1X1 op Y1X0 op Y0X0 op Y0

X3X3 X2X2 X1X1 X0X0

Y3Y3 Y2Y2 Y1Y1 Y0Y0opop

X0 op Y0X0 op Y0X3X3 X2X2 X1X1Packed Packed SPSP FP operation FP operation(e.g. (e.g. ADDPS xmm1, xmm2ADDPS xmm1, xmm2))

Scalar Scalar SPSP FP operation FP operation (e.g. (e.g. ADDSS xmm1, xmm2ADDSS xmm1, xmm2))

xmm1xmm1

xmm2xmm2

xmm1xmm1

xmm1xmm1

xmm2xmm2

xmm1xmm1

Shuffle FP operation Shuffle FP operation (e.g. (e.g. SHUFPS xmm1, xmm2, imm8SHUFPS xmm1, xmm2, imm8))

X3X3 X2X2 X1X1 X0X0

Y3Y3 Y2Y2 Y1Y1 Y0Y0

Y3 .. Y0Y3 .. Y0 Y3 .. Y0Y3 .. Y0 X3 .. X0X3 .. X0 X3 .. X0X3 .. X0

Shuffle Shuffle FPFP operation (8-bit imm) operation (8-bit imm) (e.g. (e.g. SHUFPS xmm1, xmm2,SHUFPS xmm1, xmm2, 0xf1 0xf1) )

X3X3 X2X2 X1X1 X0X0

Y3Y3 Y2Y2 Y1Y1 Y0Y0

xmm1xmm1

Y3Y3 X0X0 X1X1Y3Y3

xmm2xmm2

xmm1xmm1

X0X0

opop

Packed Packed DPDP FP operation FP operation(e.g. (e.g. ADDPDADDPD xmm1, xmm2xmm1, xmm2))

Scalar Scalar DPDP FP operation FP operation (e.g. (e.g. ADDSDADDSD xmm1, xmm2xmm1, xmm2))

xmm1xmm1

xmm2xmm2

xmm1xmm1

xmm1xmm1

xmm2xmm2

xmm1xmm1

Shuffle FP operation Shuffle FP operation (e.g. (e.g. SHUFPS xmm1, xmm2, imm8SHUFPS xmm1, xmm2, imm8))Shuffle Shuffle DPDP operation (2-bit imm) operation (2-bit imm)(e.g. (e.g. SHUFPD xmm1, xmm2, SHUFPD xmm1, xmm2, imm2imm2) )

X1X1

Y0Y0Y1Y1

X0 op Y0X0 op Y0X1 op Y1X1 op Y1

opop

X0X0X1X1

Y0Y0Y1Y1

X0 op Y0X0 op Y0X1 X1

opop

X0X0X1X1

Y0Y0Y1Y1

X1 or X0X1 or X0Y1 or Y0 Y1 or Y0

SSESSE

SSE2SSE2

41

HyperThreading • Intel Xeon Processor and Intel Xeon MP Processor• Enable Simultaneous Multi-Threading (SMT)

– Exploit ILP through TLP (—Thread-Level Parallelism)– Issuing and executing multiple threads at the same snapshot

• Single P4 w/ HT appears to be 2 logical processors2 logical processors• Share the same execution resources

– dTLB shared with logical processor ID– Some other shared resources are partitioned (next slide)

• Architectural states and some microarchitectural states are duplicated– IPs, iTLB, streaming buffer– Architectural register file– Return stack buffer– Branch history buffer– Register Alias Table

42

Multithreading (MT) Paradigms

Thread 1Unused

Exec

utio

n Ti

me

FU1 FU2 FU3 FU4

ConventionalSuperscalar

SingleThreaded

SimultaneousMultithreading(or Intel’s HT)

Fine-grainedMultithreading(cycle-by-cycle

Interleaving)

Thread 2Thread 3Thread 4Thread 5

Coarse-grainedMultithreading

(Block Interleaving)

Chip Multiprocessor

(CMP) or called

Multi-Core Processorstoday

43

HyperThreading Resource Partitioning• TC (or UROM) is alternatively accessed per

cycle for each logical processor unless one is stalled due to TC miss

op queue (into ½) after fetched from TC• ROB (126/2)• LB (48/2)• SB (24/2) (32/2 for Prescott)• General op queue and memory op queue

(1/2) • TLB (½?) as there is no PID• Retirement: alternating between 2 logical

processors

Devices & Hardware

Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netburst, Pentiium 4