COMP 212 Computer Organization & Architecture Pipeline Re-Capcomp212/lec2008/lec-12-risc... · 2008-11-26 · Comp 212 Computer Org & ArchComp 212 Computer Org & Arch 1 Z. Li, 2008

Comp 212 Computer Org & ArchComp 212 Computer Org & ArchComp 212 Computer Org & ArchComp 212 Computer Org & Arch 1 Z. Li, 2008

COMP 212 Computer Organization & Architecture

COMP 212 Fall 2008

Lecture 12

RISC & Superscalar


Pipeline Re-Cap

• Pipeline is ILP -Instruction Level Parallelism

– Divide instruction cycles into stages, overlapped execution

– Could potentially achieve k time speed up for k-stage pipelines

• Pipeline Hazards:

– Structural: two micro-ops requires the same circuits in the same

cycle

– Control: target branch PC not known until execution

– Data: successive instructions read the output of previous instruction


Instruction Micro-Operations

• An 6-stage pipeline

– Execution takes longer

than fetch

– Break up execution into

sub-cycles, i.e, DI, CO, FO,

EI, WO.

– Allow overlapping, or pre-

fetch the command

– Branch : may have to re-

fetch the correct

instruction


Instruction Pipeline – no hazard

Speedup: 9x6=54 (no pipeline) vs 14 (pipelined) time slots.


Conditional branching

• The correct PC address is runtime dependent

Branch


Alternative Pipeline View

Flush out I6-I3

Found thatCorrect PC should be I15


Speedup – perfect case

• k-stage pipeline, n instructions, execution time speed up:


Dealing with Branches

• Pipeline efficiency depends on a steady stream of

instructions that fills up the pipeline

• Conditional branching is a major drawback for efficiency

• Can be deal with by:

– Multiple Streams

– Prefetch Branch Target

– Loop buffer

– Branch prediction

– Delayed branching


Branch Prediction – Static Solutions

• Predict never taken

– Assume that jump will not happen

– Always fetch next instruction

– 68020 & VAX 11/780

• Predict always taken

– Assume that jump will happen

– Always fetch target instruction

• Predict by opcode

– By collecting stats on different opcode w.r.t. branching

– Correct rate > 75%


Branch Prediction – Dynamic, Runtime Based

• Taken/Not taken switch

– Use 1 or 2 bits to record taken/not taken history

– Good for loops

• Branch history table

– Based on previous history

– Good for loops


Branch Prediction State Diagram


RISC

Reduced Instruction Set Computer


Motivation of RISC

• Improve Pipeline efficiency

– Fixed instruction format and small number of instructions:

» Make the operations more predictable and manageable

– Large register files

» avoid data dependency and hazard

– Both compile time and run time pipeline optimization,

» register renaming, out of order execution.


A little bit of history….

• The computer family concept

– IBM System/360 1964, DEC PDP-8

– Separates architecture from implementation

• Microporgrammed control unit

– Idea by Wilkes 1951, produced by IBM S/360 1964

– Flexibility and extensibility in CPU control implementation.

• Cache memory

– IBM S/360 model 85 1969


A bit of history….

• Solid State RAM

– (See memory notes)

• Microprocessors

– Intel 4004 1971

• Pipelining

– Introduces parallelism into fetch execute cycle

• Multiple processors


The Next Step - RISC

• Reduced Instruction Set Computer

• Key features

– Large number of general purpose registers

– or use of compiler technology to optimize register use

– Limited and simple instruction set

– Emphasis on optimising the instruction pipeline


Instruction Characteristics

• Operations Performed

– Functions to be performed, how it interacts with memory

• Operands Used

– Types of operands

– Memory organization and addressing modes

• Executing Sequence

– Control and pipeline operations


Operations

• Assignments

– Movement of data

• Conditional statements (IF,THEN, FOR, WHILE)

– Sequence control

• Procedure call-return is very time consuming

• Some HLL instruction lead to many machine code

operations


Operation Statistics

• In High Level Language (HLL) like C/Pascal, assignment is

the dominating operation

• Number of machine instruction/memory references:


Operands

• Mainly local scalar variables

• Optimisation should concentrate on accessing local

variables

Pascal C Average

Integer Constant 16% 23% 20%

Scalar Variable 58% 53% 55%

Array/Structure 26% 24% 25%


Procedure Calls

• Time consuming,

– Depends on number of parameters passed

– Depends on level of nesting

• Most programs do not do a lot of calls followed by lots of returns

• Most variables are local


Implications

• Best support is given by optimising most used and most

time consuming features

• Large number of registers

– Operand referencing

• Careful design of pipelines

– Branch prediction etc.

• Simplified (reduced) instruction set


Large Register File

• Software solution

– Require compiler to allocate registers

– Allocate based on most used variables in a given time

– Requires sophisticated program analysis

• Hardware solution

– Have more registers

– Thus more variables will be in registers


Why CISC ?

• Software costs far exceed hardware costs

• Increasingly complex high level languages (HLL)

• Semantic gap: machine instruction vs HLL instruction

• Leads to:

– Large instruction sets

– More addressing modes

– Hardware implementations of HLL (high level language) statements

» e.g. CASE (switch) on VAX


Intention of CISC

• Ease compiler writing

• Improve execution efficiency

– Complex operations in microcode/micro-ops

• Support more complex HLLs

• However, CISC instructions are complex, hard to predict

and optimize.


Variable access localization


Registers for Local Variables

• Register is the fastest storage

– Better than cache and memory

• Try to limit the data assignment to registers would be

good for performance

– Software approach: compiler figure out variable assignment to

register at compile time

– Hardware approach: register windows:


Register Windows

• Most operands reference several local variables in the

function, along with couple of globals

• Function calls change local variable set

• Function calls also involves parameters to be passed

• So, instead of using stack to save local variables, and

pass parameters, partition register file into sets,

• And select different window to access it according to

program execution.


Register Windows cont.

• Three areas within a register set

– Parameter registers

– Local registers

– Temporary registers

• Examples:

– Berkeley RISC use 8 windows of 16 registers each


Overlapping Register Windows

– Temporary registers from one set overlap parameter registers from

the next

– This allows parameter passing without moving data


Circular Buffer diagram

• Managing register window

– When a call is made, a current window

pointer is moved to show the

currently active register window

– If all windows are in use, an interrupt

is generated and the oldest window

(the one furthest back in the call

nesting) is saved to memory

– A saved window pointer indicates

where the next saved windows should

restore to


Global Variables

• Allocated by the compiler to memory

– Inefficient for frequently accessed variables

• Have a set of registers for global variables

– Eg. Requires R0~R7 to be used for storing globals.


Referencing variable in windowed register


Referencing variable in cache


Registers v Cache

• Windowed Register:

– stores all variables of the last N-1 most recent procedural calls,

faster , Handles globals well

• Cache:

– store a selection of recent variables, more efficient usage of memory,


Compiler Based Register Optimization

• Assume small number of registers (16-32)• Optimizing is up to compiler• HLL programs have no explicit references to registers• Assign symbolic or virtual register to each candidate

variable • Map (unlimited) symbolic registers to real registers• Symbolic registers that do not overlap can share real

registers• If you run out of real registers some variables use

memory


How to assign variables to registers ?

• Graph Coloring Algorithm:

– Build register interference graph,

– 2 variables if alive at the same time, or

interfere with each other, draw an edge

– Try to find smallest number of colors for all

nodes, such that nodes interfering each other do

not have the same color

– Each color is assigned to a different register


RISC Pipelining

• Most instructions are register to register

• Two phases of execution

– I: Instruction fetch

– E: Execute

» ALU operation with register input and output

• For load and store

– I: Instruction fetch

– E: Execute

» Calculate memory address

– D: Memory

» Register to memory or memory to register operation


Effects of Pipelining

13 cycles 10 cycles, 1 mem port

8 cycles, 2 mem ports


Optimization of Pipelining

• Out of order execution

– Insertion of NoOp to avoid clearing pipelines by circuits

– Out of order execution:


CISC vs RISC: a summary


Comparison of CISC/RISC processors


CISC vs RISC

• Compiler simplification?

– Complex machine instructions harder to exploit

– Optimization more difficult

• Smaller programs?

– Program takes up less memory but…

– Memory is now cheap

– May not occupy less bits, just look shorter in symbolic form

» More instructions require longer op-codes

» Register references require fewer bits


CISC vs RISC

• Faster programs?

– Bias towards use of simpler instructions

– More complex control unit

– Microprogram control store larger

– thus simple instructions take longer to execute

• It is far from clear that CISC is the appropriate solution


RISC Characteristics

• Simple instructions

– One instruction per cycle

– Register to register operations

– Few, simple addressing modes

– Few, simple instruction formats

– Hardwired design (no microcode)

– Fixed instruction format

• More compile time optimization effort

– Register renaming

– Out of order execution


No conclusive comparison

• Quantitative– compare program sizes and execution speeds

• Qualitative– examine issues of high level language support and use of VLSI real

estate

• Problems– No pair of RISC and CISC that are directly comparable– No definitive set of test programs– Difficult to separate hardware effects from complier effects– Most comparisons done on “toy” rather than production machines– Most commercial devices are a mixture


Superscalar Architecture


What is Superscalar?

• Scalar computer: handle one instruction one data at a time

• Vector Computer: handle multiple data at a time.

• Superscalar Computer:

– Multiple independent pipelines (2 int, 2 fp, 1 mem) are implemented

– Each pipeline has stages which can also handle multiple instructions


Superscalar vs Superpipeline

• Superpipeline:

– Many pipeline stages

need less than half a

clock cycle

– Double internal clock

speed gets two tasks per

external clock cycle

• Superscalar allows

parallel fetch execute


Limitations

• Instruction level parallelism

• Compiler based optimisation

• Hardware techniques

• Limited by

– True data dependency

– Procedural dependency

– Resource conflicts

– Output dependency

– Antidependency


True Data Dependency

• ADD r1, r2 (r1 := r1+r2;)

• MOVE r3,r1 (r3 := r1;)

• Can fetch and decode second instruction in parallel with

first

• Can NOT execute second instruction until first is finished


Procedural Dependency

• Can not execute instructions after a branch in parallel

with instructions before a branch

• Also, if instruction length is not fixed, instructions have

to be decoded to find out how many fetches are needed

• This prevents simultaneous fetches


Resource Conflict

• Two or more instructions requiring access to the same

resource at the same time

– e.g. two arithmetic instructions

• Solution: Can duplicate resources

– e.g. have two arithmetic units


Effect of Dependency

• Illustration of

– Data dependency

– Procedural (branch) dependency

– Resource dependency


Design Issues

• Instruction level parallelism

– Instructions in a sequence are independent

– Execution can be overlapped

– Governed by data and procedural dependency

• Machine Parallelism

– Ability to take advantage of instruction level parallelism

– Governed by number of parallel pipelines


Instruction Issue Policy

• Order in which instructions are fetched

• Order in which instructions are executed

• Order in which instructions change registers and memory


• Issue instructions in the order they occur

• Not very efficient

• May fetch >1 instruction

• Instructions must stall if necessary (2 fetch, 3 exe, 2 mem ports)

In-order issue & exec


• Output dependency

– R3:= R3 + R5; (I1)

– R4:= R3 + 1; (I2)

– R3:= R5 + 1; (I3)

– I2 depends on result of I1 - data dependency

– If I3 completes before I1, the result from I1 will be wrong - output (read-

write) dependency

In order issue, out-of-order execute


• Decouple decode pipeline from execution pipeline

• Can continue to fetch and decode until this instruction window pipeline

is full

• When a functional unit becomes available an instruction can be

executed

• Since instructions have been decoded, processor can look ahead

Out-of-order issue and execute


Antidependency

• Write-write dependency

– R3:=R3 + R5; (I1)

– R4:=R3 + 1; (I2)

– R3:=R5 + 1; (I3)

– R7:=R3 + R4; (I4)

– I3 can not complete before I2 starts as I2 needs a value in R3 and I3

changes R3


Register Renaming

• Output and antidependencies occur because register

contents may not reflect the correct ordering from the

program

• May result in a pipeline stall

• Registers allocated dynamically

– i.e. registers are not specifically named


Register Renaming example

• R3b:=R3a + R5a (I1)

• R4b:=R3b + 1 (I2)

• R3c:=R5a + 1 (I3)

• R7b:=R3c + R4b (I4)

• Without subscript refers to logical register in instruction

• With subscript is hardware register allocated

• Note R3a R3b R3c


Machine Parallelism

• Duplication of Resources

• Out of order issue

• Renaming

• Not worth duplication functions without register renaming

• Need instruction window large enough (more than 8)


Performances


Branch Prediction

• 80486 fetches both next sequential instruction after

branch and branch target instruction

• Gives two cycle delay if branch taken


RISC - Delayed Branch

• Calculate result of branch before unusable instructions

pre-fetched

• Always execute single instruction immediately following

branch

• Keeps pipeline full while fetching new instruction stream

• Not as good for superscalar

– Multiple instructions need to execute in delay slot

– Instruction dependence problems

• Revert to branch prediction


Superscalar Implementation

• Simultaneously fetch multiple instructions

• Logic to determine true dependencies involving register values

• Mechanisms to communicate these values

• Mechanisms to initiate multiple instructions in parallel

• Resources for parallel execution of multiple instructions

• Mechanisms for committing process state in correct order


PowerPC

• Direct descendent of IBM 801, RT PC and RS/6000

• All are RISC

• RS/6000 first superscalar

• PowerPC 601 superscalar design similar to RS/6000

• Later versions extend superscalar concept


PowerPC 601 General View


PowerPC 601 Pipeline


Summary

• RISC

– A design that simplifies elementary instruction processing

– Allows for optimization and improvements of efficiency by compiler

and run-time circuits later

– Main-stream solution now, MIPS, SPARC, PowerPC, …etc.

• Superscalar

– Multiple fetch, execution and memory port units

– Additional dimension to achieve ILP

– Brings more complex issues to consistence and correctness of

execution.

Documents

COMP 212 Computer Organization & Architecture Pipeline Re-Capcomp212/lec2008/lec-12-risc... · 2008-11-26 · Comp 212 Computer Org & ArchComp 212 Computer Org & Arch 1 Z. Li, 2008