1 CENG 450 Computer Systems and Architecture Lecture 12 Amirali Baniasadi [email protected]

1

CENG 450Computer Systems and Architecture

Lecture 12

Amirali Baniasadi

[email protected]

2

Multiple Issue

• Multiple Issue is the ability of the processor to start more than one instruction in a given cycle.

• Superscalar processors

• Very Long Instruction Word (VLIW) processors

3

1990’s: Superscalar Processors

Bottleneck: CPI >= 1 Limit on scalar performance (single instruction issue)

Hazards Superpipelining? Diminishing returns (hazards + overhead)

How can we make the CPI = 0.5? Multiple instructions in every pipeline stage (super-scalar)

1 2 3 4 5 6 7 Inst0 IF ID EX MEM WB Inst1 IF ID EX MEM WB Inst2 IF ID EX MEM WB Inst3 IF ID EX MEM WB Inst4 IF ID EX MEM WB Inst5 IF ID EX MEM WB

4

Superscalar Vs. VLIW

Religious debate, similar to RISC vs. CISC

Wisconsin + Michigan (Super scalar) Vs. Illinois (VLIW) Q. Who can schedule code better, hardware or

software?

5

Hardware Scheduling

High branch prediction accuracy Dynamic information on latencies (cache misses) Dynamic information on memory dependences Easy to speculate (& recover from mis-speculation) Works for generic, non-loop, irregular code Ex: databases, desktop applications, compilers

Limited reorder buffer size limits “lookahead” High cost/complexity Slow clock

6

Software Scheduling

Large scheduling scope (full program), large “lookahead”Can handle very long latencies

Simple hardware with fast clock Only works well for “regular” codes (scientific, FORTRAN)

Low branch prediction accuracyCan improve by profiling

No information on latencies like cache missesCan improve by profiling

Pain to speculate and recover from mis-speculationCan improve with hardware support

7

Superscalar Processors

Pioneer: IBM (America => RIOS, RS/6000, Power-1) Superscalar instruction combinations

1 ALU or memory or branch + 1 FP (RS/6000) Any 1 + 1 ALU (Pentium) Any 1 ALU or FP+ 1 ALU + 1 load + 1 store + 1 branch

(Pentium II)

Impact of superscalar More opportunity for hazards (why?) More performance loss due to hazards (why?)

8

Superscalar Processors

• Issues varying number of instructions per clock

• Scheduling: Static (by the compiler) or dynamic(by the hardware)

• Superscalar has a varying number of instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo).

• IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000

9

Elements of Advanced Superscalars

High performance instruction fetching Good dynamic branch and jump prediction Multiple instructions per cycle, multiple branches per cycle?

Scheduling and hazard elimination Dynamic scheduling Not necessarily: Alpha 21064 & Pentium were statically scheduled Register renaming to eliminate WAR and WAW

Parallel functional units, paths/buses/multiple register ports High performance memory systems Speculative execution

10

SS + DS + Speculation

Superscalar + Dynamic scheduling + SpeculationThree great tastes that taste great together CPI >= 1?

Overcome with superscalar Superscalar increases hazards

Overcome with dynamic scheduling RAW dependences still a problem?

Overcome with a large window Branches a problem for filling large window? Overcome with speculation

11

The Big Picture

&Static program Fetch & branch

predict execution

issue

Reorder & commit

12

Superscalar Microarchitecture

Integer register file

Floating point register file

Decode rename dispatch

Floating point inst. buffer

Integer address inst buffer

Functional units

Functional units and data cache

Memory interface

Reorder and commit

Inst.buffer

Pre-decode Inst.

Cache

13

Register renaming methods

First Method: Physical register file vs. logical (architectural) register file. Mapping table used to associate physical reg w/ current value of

log. Reg use a free list of physical registers Physical register file bigger than log register file

Second Method: physical register file same size as logical Also, use a buffer w/ one entry per inst. Reorder buffer.

14

Register renaming: first method

R2 R6 R13

R8

R7

R5

R9

R1

r0

r1

r2

r3

r4

R6 R13

R8

R7

R5

R9

R2

r0

r1

r2

r3

r4

Add r3,r3,4

Mapping table

Free List

Mapping table

Free List

15

Program

Instr

ucti

on

issu

es p

er

cy

cle

0

10

20

30

40

50

60

gcc espresso li fpppp doducd tomcatv

11

15

12

29

54

10

15

12

49

16

10

1312

35

15

44

9 10 11

20

11

28

5 5 6 5 57

4 45

45 5

59

45

Infinite 256 128 64 32 None

More Realistic HW: Register Impact

Effect of limiting the number of renaming registers

Integer: 5 - 15

FP: 11 - 45

IPC

16

Reorder Buffer

Place data in entry when execution finished

Reserve entry at tail when dispatched

Remove from head when complete

Bypass to other instructions when needed

17

…..…..

register renaming:reorder buffer

r3

R8

R7

R5

R9

rob6

r0

r1

r2

r3

r4

R3 0 R3 ….

R8

R7

R5

R9

rob8

r0

r1

r2

r3

r4

Before add r3,r3,4Add r3, rob6, 4add rob8,rob6,4

Reorder buffer

Reorder buffer

7 6 0 8 7 6 0

18

Instruction Buffers

Integer register file

Floating point register file

Decode rename dispatch

Floating point inst. buffer

Integer address inst buffer

Functional units

Functional units and data cache

Memory interface

Reorder and commit

Inst.buffer

Pre-decode Inst.

Cache

19

Issue Buffer Organization

a) Single, shared queue b)Multiple queue; one per inst. type

No out-of-orderNo Renaming

No out-of-order inside queuesQueues issue out of order

20

Issue Buffer Organization

c) Multiple reservation stations; (one per instruction type or big pool)

NO FIFO ordering Ready operands, hardware available execution starts Proposed by Tomasulo

From Instruction Dispatch

21

Typical reservation station

Operation source 1 data 1 valid 1 source 2 data 2 valid 2 destination

22

Memory Hazard Detection Logic

Address add & translation

Address compare

Load address buffer

Store address buffer

loads

stores

Hazard Control

To memoryInstruction issue

23

Example

MIPS R10000, Alpha 21264, AMD k5 : self study READ THE PAPER.

24

VLIW

VLIW: Very long instruction word In-order pipe, but each “instruction” is N instructions (VLIW)

Typically “slotted” (I.e., 1st must be ALU, 2nd must be load,etc., )

VLIW travels down pipe as a unit Compiler packs independent instructions into VLIW

IF ID

ALUALUAdFP

WBWBWBWB

MEMFP

25

Very Long Instruction Word

VLIW - issues a fixed number of instructions formatted either as one very large instruction or as a fixed packet of smaller instructions

Fixed number of instructions (4-16) scheduled by the compiler; put operators into wide templates Started with microcode (“horizontal microcode”) Joint HP/Intel agreement in 1999/2000 Intel Architecture-64 (IA-64) 64-bit address /Itanium Explicitly Parallel Instruction Computer (EPIC) Transmeta: translates X86 to VLIW Many embedded controllers (TI, Motorola) are VLIW

26

Pure VLIW: What Does VLIW Mean?

All latencies fixed

All instructions in VLIW issue at once

No hardware interlocks at all

Compiler responsible for scheduling entire pipelineIncludes stall cyclesPossible if you know structure of pipeline and

latencies exactly

27

Problems with Pure VLIW

Latencies are not fixed (e.g., caches) Option I: don’t use caches (forget it) Option II: stall whole pipeline on a miss? Option III: stall instructions waiting for memory? (need out-of-

order)

Different implementations Different pipe depths, different latencies New pipeline may produce wrong results (code stalls in wrong

place) Recompile for new implementations?

Code compatibility is very important, made Intel what it is

28

Key: Static Scheduling

VLIW relies on the fact that software can schedule code well

Loop unrolling (we have seen this one already)Code growthPoor scheduling along unrolled copies

29

Limits to Multi-Issue Machines

Inherent limitations of ILP 1 branch in 5 instructions => how to keep a 5-way VLIW busy? Latencies of units => many operations must be scheduled Need about Pipeline Depth x No. Functional Units of independent

operations to keep machines busy.

Difficulties in building HW Duplicate Functional Units to get parallel execution Increase ports to Register File Increase ports to memory Decoding Superscalar and impact on clock rate, pipeline depth: Complexity-effective designs

30

Limitations specific to either Superscalar or VLIW implementation

Decode issue in Superscalar VLIW code size: unroll loops + wasted fields in VLIW VLIW lock step => 1 hazard & all instructions stall VLIW & binary compatibility

Limits to Multi-Issue Machines

31

Multiple Issue Challenges

While Integer/FP split is simple for the HW, get CPI of 0.5 only for programs with: Exactly 50% FP operations No hazards

If more instructions issue at same time, greater difficulty of decode and issue Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or

2 instructions can issue

VLIW: tradeoff instruction space for simple decoding The long instruction word has room for many operations By definition, all the operations the compiler puts in the long instruction

word are independent => execute in parallel E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch

• 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide

Need compiling technique that schedules across several branches

32

How is this used in practice? Rather than predicting the direction of a branch, execute the

instructions on both side!! We early on know the target of a branch, long before we

know it if will be taken or not. So begin fetching/executing at that new Target PC. But also continue fetching/executing as if the branch NOT

taken.

HW Support for More ILP

33

Studies of ILP

Conflicting studies of amount of improvement availableBenchmarks (vectorized FP Fortran vs. integer C

programs)Hardware sophisticationCompiler sophistication

How much ILP is available using existing mechanisms with increasing HW budgets?

Do we need to invent new HW/SW mechanisms to keep on processor performance curve?

34

Summary

Static ILP Simple, advanced loads, predication hardware (fast

clock), complex compilers

VLIW

35

Summary

Dynamic ILP Instruction buffer

Split ID into two stages one for in-order and other for out-of-order issue

Socreboard out-of-order, doesn’t deal with WAR/WAW hazards

Tomasulo’s algorithmUses register renaming to eliminate WAR/WAW

hazards Dynamic scheduling + speculation Superscalar

Documents

1 CENG 450 Computer Systems and Architecture Lecture 12 Amirali Baniasadi [email protected]