1COMP381 by M. Hamdi
CommercialCommercial Superscalar and Superscalar and VLIW ProcessorsVLIW Processors
2COMP381 by M. Hamdi
Superscalar Processors
0-8 instruction per cycleStatic scheduling
all pipe line hazards are checkedinstructions in order
Pipeline control logic will check hazards between the instructions in execution phase and the new instruction sequences. In case of hazard, only those instructions preceding that one in the instruction sequence will be issued.
Issue HWPipeline
Instruction Memory
Issue Packet
Complexity of HWThis stage is pipelined in all dynamic super scalar system
4COMP381 by M. Hamdi
Cache/Cache/MemoryMemory
Fetch Fetch UnitUnit
EUEU
EUEU
EUEU
Register FileRegister FileMulti Operation
Multiple Instruction
Instruction
Basic Superscalar Approach
Decode/Decode/Issue Issue UnitUnit
5COMP381 by M. Hamdi
1Fetch
2Fetch
3Decode
4Decode
5Decode
6Rename
7ROB Rd
8Rdy/Sch
9Dispatch
10Exec
3 4TC Fetch
5Drive
6Alloc
9Que
10Sch
12Sch
13Disp
14Disp
15RF
16RF
17Ex
18Flgs
19BrCk
20Drive
1 2TC Nxt IP
7 8Rename
11Sch
Typical P6 Pipeline
Typical Pentium 4 Pipeline
Pentium 4 Pipeline Stages vs. Pentium 3 Pipeline Stages
6COMP381 by M. Hamdi
Pentium 3 Pipeline Architecture
• It is a It is a 3-way3-way issue supersclar issue supersclar
• It has 5 execution units (Integer ALU, integer multiply, FP It has 5 execution units (Integer ALU, integer multiply, FP multiply, FP add, FP divide)multiply, FP add, FP divide)
7COMP381 by M. Hamdi
Pentium 3 Pipeline stages
1 Fetch
2 Fetch
3 Decode
4 Decode
5 Decode
6 Rename registers
7 ROB (reordering instructions)
8 Rdy/Sch (Scheduling Instructions to be executed)
9 Dispatch
10 Exec
8COMP381 by M. Hamdi
Pentium 4 pipeline stages
Stage Work
1 Trace Cache next instruction pointer
2 Trace Cache next instruction pointer
3 Trace Cache fetch
4 Trace Cache fetch
5 Drive
6 Allocation
7 Rename
8 Rename
9 Queue
10 Schedule
11 Schedule
12 Schedule
13 Dispatch
14 Dispatch
15 Register Files
16 Register Files
17 Execute
18 Flags
19 Branch Check
20 Drive
Increasing the number of pipeline stages increases the clock frequency
• It took the industry 28 years to hit 1 GHz and only 18 months to reach 2 GHz.
• The price paid for deeper pipelines is that it is very difficult to ovoid stalls (That is why when Pentium 4 was introduced its performance was worse than Pentium 3.)
It is a 5-issue supersclar It is a 5-issue supersclar processorprocessor
9COMP381 by M. Hamdi
3 4TC Fetch
5Drive
6Alloc
9Que
10Sch
12Sch
13Disp
14Disp
15RF
16RF
17Ex
18Flgs
19BrCk
20Drive
1 2TC Nxt IP
7 8Rename
11Sch
3.2 GB
/s System
Interface
L2 Cache and Control
BTB
BT
B &
I-TL
B
Decoder
Trace C
ache
Renam
e/Alloc
op Q
ueues
Schedulers
Integer RF
FP
RFCode
ROM
StoreAGULoad AGUALUALUALUALU
FP moveFP store
FmulFaddMMXSSE
L1 D
-Cache and D
-TL
BTC Nxt IP: Trace cache next instruction pointerPointer indicating location of next instruction.
10COMP381 by M. Hamdi
3.2 GB
/s System
Interface
L2 Cache and Control
BTB
BT
B &
I-TL
B
Decoder
Trace C
ache
Renam
e/Alloc
op Q
ueues
Schedulers
Integer RF
FP
RFCode
ROM
StoreAGULoad AGUALUALUALUALU
FP moveFP store
FmulFaddMMXSSE
L1 D
-Cache and D
-TL
B
3 4TC Fetch
5Drive
6Alloc
9Que
10Sch
12Sch
13Disp
14Disp
15RF
16RF
17Ex
18Flgs
19BrCk
20Drive
1 2TC Nxt IP
7 8Rename
11Sch
TC Fetch: Trace cache fetchRead the decoded instructions (uOPs)
11COMP381 by M. Hamdi
3.2 GB
/s System
Interface
L2 Cache and Control
BTB
BT
B &
I-TL
B
Decoder
Trace C
ache
Renam
e/Alloc
op Q
ueues
Schedulers
Integer RF
FP
RFCode
ROM
StoreAGULoad AGUALUALUALUALU
FP moveFP store
FmulFaddMMXSSE
L1 D
-Cache and D
-TL
B
3 4TC Fetch
5Drive
6Alloc
9Que
10Sch
12Sch
13Disp
14Disp
15RF
16RF
17Ex
18Flgs
19BrCk
20Drive
1 2TC Nxt IP
7 8Rename
11Sch
Drive: Wire delayDrive the uOPs to the allocator
12COMP381 by M. Hamdi
3.2 GB
/s System
Interface
L2 Cache and Control
BTB
BT
B &
I-TL
B
Decoder
Trace C
ache
Renam
e/Alloc
op Q
ueues
Schedulers
Integer RF
FP
RFCode
ROM
StoreAGULoad AGUALUALUALUALU
FP moveFP store
FmulFaddMMXSSE
L1 D
-Cache and D
-TL
B
3 4TC Fetch
5Drive
6Alloc
9Que
10Sch
12Sch
13Disp
14Disp
15RF
16RF
17Ex
18Flgs
19BrCk
20Drive
1 2TC Nxt IP
7 8Rename
11Sch
Alloc: Allocate resources required for execution. Theresources include Load buffers, Store buffers, etc..
13COMP381 by M. Hamdi
3.2 GB
/s System
Interface
L2 Cache and Control
BTB
BT
B &
I-TL
B
Decoder
Trace C
ache
Renam
e/Alloc
op Q
ueues
Schedulers
Integer RF
FP
RFCode
ROM
StoreAGULoad AGUALUALUALUALU
FP moveFP store
FmulFaddMMXSSE
L1 D
-Cache and D
-TL
B
3 4TC Fetch
5Drive
6Alloc
9Que
10Sch
12Sch
13Disp
14Disp
15RF
16RF
17Ex
18Flgs
19BrCk
20Drive
1 2TC Nxt IP
7 8Rename
11Sch
Rename: Register renaming
14COMP381 by M. Hamdi
3.2 GB
/s System
Interface
L2 Cache and Control
BTB
BT
B &
I-TL
B
Decoder
Trace C
ache
Renam
e/Alloc
op Q
ueues
Schedulers
Integer RF
FP
RFCode
ROM
StoreAGULoad AGUALUALUALUALU
FP moveFP store
FmulFaddMMXSSE
L1 D
-Cache and D
-TL
B
3 4TC Fetch
5Drive
6Alloc
9Que
10Sch
12Sch
13Disp
14Disp
15RF
16RF
17Ex
18Flgs
19BrCk
20Drive
1 2TC Nxt IP
7 8Rename
11Sch
Que: Write into the uOP QueueuOPs are placed into the queues, where they are held until there is room in the schedulers
15COMP381 by M. Hamdi
3.2 GB
/s System
Interface
L2 Cache and Control
BTB
BT
B &
I-TL
B
Decoder
Trace C
ache
Renam
e/Alloc
op Q
ueues
Schedulers
Integer RF
FP
RFCode
ROM
StoreAGULoad AGUALUALUALUALU
FP moveFP store
FmulFaddMMXSSE
L1 D
-Cache and D
-TL
B
3 4TC Fetch
5Drive
6Alloc
9Que
10Sch
12Sch
13Disp
14Disp
15RF
16RF
17Ex
18Flgs
19BrCk
20Drive
1 2TC Nxt IP
7 8Rename
11Sch
Sch: ScheduleWrite into the schedulers and compute dependencies. Watch for dependency to resolve.
16COMP381 by M. Hamdi
3.2 GB
/s System
Interface
L2 Cache and Control
BTB
BT
B &
I-TL
B
Decoder
Trace C
ache
Renam
e/Alloc
op Q
ueues
Schedulers
Integer RF
FP
RFCode
ROM
StoreAGULoad AGUALUALUALUALU
FP moveFP store
FmulFaddMMXSSE
L1 D
-Cache and D
-TL
B
3 4TC Fetch
5Drive
6Alloc
9Que
10Sch
12Sch
13Disp
14Disp
15RF
16RF
17Ex
18Flgs
19BrCk
20Drive
1 2TC Nxt IP
7 8Rename
11Sch
Disp: DispatchSend the uOPs to the appropriate execution unit.
17COMP381 by M. Hamdi
3.2 GB
/s System
Interface
L2 Cache and Control
BTB
BT
B &
I-TL
B
Decoder
Trace C
ache
Renam
e/Alloc
op Q
ueues
Schedulers
Integer RF
FP
RFCode
ROM
StoreAGULoad AGUALUALUALUALU
FP moveFP store
FmulFaddMMXSSE
L1 D
-Cache and D
-TL
B
3 4TC Fetch
5Drive
6Alloc
9Que
10Sch
12Sch
13Disp
14Disp
15RF
16RF
17Ex
18Flgs
19BrCk
20Drive
1 2TC Nxt IP
7 8Rename
11Sch
RF: Register FileRead the register file. These are the source(s) for the pending operation (ALU or other).
18COMP381 by M. Hamdi
3.2 GB
/s System
Interface
L2 Cache and Control
BTB
BT
B &
I-TL
B
Decoder
Trace C
ache
Renam
e/Alloc
op Q
ueues
Schedulers
Integer RF
FP
RFCode
ROM
StoreAGULoad AGUALUALUALUALU
FP moveFP store
FmulFaddMMXSSE
L1 D
-Cache and D
-TL
B
3 4TC Fetch
5Drive
6Alloc
9Que
10Sch
12Sch
13Disp
14Disp
15RF
16RF
17Ex
18Flgs
19BrCk
20Drive
1 2TC Nxt IP
7 8Rename
11Sch
Ex: ExecuteExecute the uOPs on the appropriate execution port.
19COMP381 by M. Hamdi
3.2 GB
/s System
Interface
L2 Cache and Control
BTB
BT
B &
I-TL
B
Decoder
Trace C
ache
Renam
e/Alloc
op Q
ueues
Schedulers
Integer RF
FP
RFCode
ROM
StoreAGULoad AGUALUALUALUALU
FP moveFP store
FmulFaddMMXSSE
L1 D
-Cache and D
-TL
B
3 4TC Fetch
5Drive
6Alloc
9Que
10Sch
12Sch
13Disp
14Disp
15RF
16RF
17Ex
18Flgs
19BrCk
20Drive
1 2TC Nxt IP
7 8Rename
11Sch
Flgs: FlagsCompute flags (zero, negative, etc..). These are typically input to a branch instruction.
20COMP381 by M. Hamdi
3.2 GB
/s System
Interface
L2 Cache and Control
BTB
BT
B &
I-TL
B
Decoder
Trace C
ache
Renam
e/Alloc
op Q
ueues
Schedulers
Integer RF
FP
RFCode
ROM
StoreAGULoad AGUALUALUALUALU
FP moveFP store
FmulFaddMMXSSE
L1 D
-Cache and D
-TL
B
3 4TC Fetch
5Drive
6Alloc
9Que
10Sch
12Sch
13Disp
14Disp
15RF
16RF
17Ex
18Flgs
19BrCk
20Drive
1 2TC Nxt IP
7 8Rename
11Sch
Br Ck: Branch CheckThe branch operation compares result of actual branch direction with the prediction.
21COMP381 by M. Hamdi
3.2 GB
/s System
Interface
L2 Cache and Control
BTB
BT
B &
I-TL
B
Decoder
Trace C
ache
Renam
e/Alloc
op Q
ueues
Schedulers
Integer RF
FP
RFCode
ROM
StoreAGULoad AGUALUALUALUALU
FP moveFP store
FmulFaddMMXSSE
L1 D
-Cache and D
-TL
B
3 4TC Fetch
5Drive
6Alloc
9Que
10Sch
12Sch
13Disp
14Disp
15RF
16RF
17Ex
18Flgs
19BrCk
20Drive
1 2TC Nxt IP
7 8Rename
11Sch
Drive: Wire delayDrive the result of the branch check to the front end of the machine.
23COMP381 by M. Hamdi
Itanium® Processor Family Architecture•EPIC: explicitly parallel instruction computing
•Instruction encoding•Bundles and templates
•Large register resources •128 integer
•128 floating point
•Support for•Software pipelining
•Predication
•Speculation (Control, Data, Load)
24COMP381 by M. Hamdi
EPIC – Explicitly Parallel Instruction Computing
•Focused on parallel execution
•Instructions are issued in bundles
•Instructions distributed among processor’s execution units according to type
•Currently up to two complete bundles can be dispatched per clock cycle
– Pipeline stages: 10 (Itanium®1), 8 (Itanium® 2)
26COMP381 by M. Hamdi
Instruction Format: Bundles & Templates
•Bundle•Set of three instructions (41 bits each)
•Template •Identifies types of instructions in bundle
27COMP381 by M. Hamdi
Instruction Format: Bundles & Templates
•Instruction types
– M: Memory
– I: Shifts and multimedia
– A: Integer Arithmetic and Logical Unit
– B: Branch
– F: Floating point
– L+X: Long (move, branch, …)
28COMP381 by M. Hamdi
MEM MEM INT INT FP FP B B B
128-bit instruction bundles from I-cacheS2 S1 S0 T
Fetch one or more bundles for execution(Implementation, Itanium® takes two.)
Try to execute all instructions inparallel, depending on available units.
Retired instruction bundles
Processor
Explicitly Parallel Instruction ComputingEPIC
functional units
MEM MEM INT INT FP FP B B B
29COMP381 by M. Hamdi
instrinstrinstr ;;instrinstr ;;instrintsrinstrinstrinstr ;;instrinstr ;;instr…
instr instr instr tmplinstr instr instr tmplinstr instr nop tmplinstr nop nop tmplinstr instr nop tmplinstr instr nop tmplintsr instr instr tmpl…
instr instr instr tmplinstr instr instr tmpl
Handwritten code
Code generator
Instruction bundles
FetchExecution
Code generator creates bundles,possibly including nops.
Can the bundle pairExecute in parallel ?
Itanium® fetches 2 bundles at a time for execution.They may or may not execute in parallel.
There are two difficulties:1) Finding instruction triplets matching the defined templates.2) Matching pairs of bundles that can execute in parallel.
30COMP381 by M. Hamdi
Today‘s Architecture Challenges
•Performance barriers :
- Memory latency
- Branches
- Loop pipelining and call / return overhead
- Hardware-based instruction scheduling
- Unable to efficiently schedule parallel execution
- Too few registers
- Unable to fully utilize multiple execution units
31COMP381 by M. Hamdi
Improving Performance
•To achieve improved performance, Itanium(R) architecture code accomplishes the following:- Increases instruction level parallelism (ILP)
- Improves branch handling
- Hides memory latencies
32COMP381 by M. Hamdi
Instruction level parallelism (ILP)
•Increase ILP by:•More resources
• Large register files
• Avoiding register contention
•3-instruction wide word• Bundle
• Facilitates parallel processing of instructions
•Enabling the compiler/assembly writer to explicitly indicate parallelism
33COMP381 by M. Hamdi
Itanium 8-stage Pipelines
• In-order issue, out-of-order completion– All functional units are fully pipelined
• Small branch misprediction penalties
FP1 FP2
IPG ROT
Inst
ruct
ion
Bu
ffe
r
EXP REN REG
MM1 MM2
EXE DET WRB
L1D1 L1D2 L1D3
FP3 FP4
MemoryMemory
IntInt
MultiMediaMultiMedia
Floating PointFloating Point