8

Click here to load reader

Pipeline

  • Upload
    veera

  • View
    419

  • Download
    0

Embed Size (px)

DESCRIPTION

pipeline in advanced microprocessor

Citation preview

Page 1: Pipeline

© 2002 Edward F. Gehringer ECE 463/521 Lecture Notes, Fall 2002 1Based on notes from Drs. Tom Conte & Eric Rotenberg of NCSUFigures from CAQA used with permission of Morgan Kaufmann Publishers. © 2003 Elsevier Science (USA)

Pipelining

We’ve already covered the basics of pipelining in Lecture 1.

We saw that cars could be built on an assembly line, and thatinstructions could be executed in much the same way.

[H&P §A.1] In the ideal situation, this could give a speedup equal tothe number of pipeline stages:

Time to execute instruction on unpipelined machineNumber of pipe stages

However, this assumes “perfectly balanced” stages—each stagerequires exactly the same amount of time.

This is rarely the case, and anyway, pipelining does involve someextra overhead.

Three aspects of RISC architectures make them easy to pipeline:

• All operations on data apply to data in registers.

• Only load and store operations move data betweenmemory and registers.

• All instructions are the same size, and there are fewinstruction formats.

An unpipelined RISC

For our examples, we’ll work with a simplified RISC instruction set. Inan unpipelined implementation, instructions take at most 5 clockcycles. One cycle is devoted to each of—

• Instruction fetch (IF).

Fetch the current instruction (the one pointed to by PC).IR ← Mem[PC]

Update the PC by addingNPC ← PC +

Page 2: Pipeline

Lecture 14 Advanced Microprocessor Design 2

• Instruction decode/register fetch (ID).

Decode the instruction.

Read the source registers from the register file.A ← Regs[IR6..10]; B = Regs[IR11..15]

Sign-extend the offset (displacement) field of theinstruction.Imm ← sign-extend(IR16..31)

Check for a possible branch (by reading values from thesource registers).Cond ← (A rel B)

Compute the branch target address by adding the to theALU_Output ← NPC + Imm

If the branch is taken, store the branch-target addressinto the PC.If (cond) PC ← ALU_Output, else PC ← NPC

What feature of the ISA makes it possible to read theregisters in this stage?

• Execute/compute effective address (EX).

The ALU operates on the operands, performing one ofthree types of functions, depending on the opcode

Ø Memory reference: ALU adds and to form the effective address.ALU_Output ←

Ø Register-register instruction: ALU performsoperation on the values read from the register file.ALU_Output ← A op B

Ø Register-immediate instruction: ALU performsoperation on the and the

Page 3: Pipeline

© 2002 Edward F. Gehringer ECE 463/521 Lecture Notes, Fall 2002 3Based on notes from Drs. Tom Conte & Eric Rotenberg of NCSUFigures from CAQA used with permission of Morgan Kaufmann Publishers. © 2003 Elsevier Science (USA)

ALU_Output ← A op Imm

In a load-store architecture, execution can be done atthe same time as effective-address computationbecause

• Memory access (MEM).Load_Mem_Data ← Mem[ALU_Output] /* Load */Mem[ALU_Output] ← B /* Store */

• Write-back (WB). If the instruction is register-register or , the result is written into the register fileat the address specified by the destination operand.Reg-Reg ALU Operation: Regs[IR16..20] ← ALU_Output

Reg-Immediate ALU Operation: Regs[IR11..15] ← ALU_Output

Load instruction: Regs[IR11..15] ← Load_Mem_Data

In this implementation, some instructions require 2 cycles, somerequire 4, and some require 5.

• 2 cycles:

• 4 cycles:

• 5 cycles:

Assuming the instruction frequencies from the integer benchmarksmentioned in the last lecture, what’s the CPI of this architecture?

Pipelining our RISC

It’s easy to pipeline this architecture—just make each clock cycle intoa pipe stage.

Page 4: Pipeline

Lecture 14 Advanced Microprocessor Design 4

Clock # 1 2 3 4 5 6 7 8 9Instruction i IF ID EX MEM WB

Instr. i+1 IF ID EX MEM WB

Instr. i+2 IF ID EX MEM WB

Instr. i+3 IF ID EX MEM WB

Instr. i+4 IF ID EX MEM WB

Here is a diagram of our instruction pipeline.

In this pipeline, the major functional units are used in different cycles,so overlapping the execution of instructions introduces few conflicts.

• Separating the instruction and data caches eliminates aconflict that would arise in the IF and MEM stages.

Of course, we have to access these caches faster than wewould in an unpipelined processor.

• The register file is used in two stages:

ALU

Instructioncache

PC

NPC

IR(inst.reg.) Regs

Sign-extend

A

B

ALU

MU

X

cond

MU

X

Datacache

LMD

MU

XImm

Instruction Fetch (IF) Instruction Decode (ID) Execute (EX) Memory (MEM) Write-back(WB)

ALU4

Page 5: Pipeline

© 2002 Edward F. Gehringer ECE 463/521 Lecture Notes, Fall 2002 5Based on notes from Drs. Tom Conte & Eric Rotenberg of NCSUFigures from CAQA used with permission of Morgan Kaufmann Publishers. © 2003 Elsevier Science (USA)

Thus, we need to perform reads and writeseach clock cycle.

To handle reads and writes to the same register, we write inthe first half of the clock cycle and read in the second half.

• Something is incomplete about our diagram of the IF stage.What?

We’ve omitted one thing from the diagram above: We need a placeto save values between pipeline stages. Otherwise, the differentinstructions in the pipeline would interfere with each other.

So we insert latches, or pipeline registers, between stages. Ofcourse, we’d need latches even in an unpipelined multicycleimplementation.

Page 6: Pipeline

Lecture 14 Advanced Microprocessor Design 6

What is our pipeline speedup, then …?

Of course, we have to allow for latch-delay time.

We also need to allow for clock skew—the maximum delay betweenwhen the clock arrives at any two registers.

Let’s define To’head = Tlatch + Tskew.

Speedup = Avg. unpipelined execution timeAvg. pipelined execution time

= Tunpipe

Tunpipe

n + To'head . n

= n (ideal case where To'head = 0)

Example: Consider the unpipelined processor in the previousexample. Assume—

• Clock cycle is 1 ns.

• Branch instructions, 20% of the total, take 2 cycles.

• Store instructions, 10% of the total, take 4 cycles.

• All other instructions take 5 cycles.

• Clock skew and latch delay add 0.2 ns. to the cycle time.

What is the speedup from pipelining?

Page 7: Pipeline

© 2002 Edward F. Gehringer ECE 463/521 Lecture Notes, Fall 2002 7Based on notes from Drs. Tom Conte & Eric Rotenberg of NCSUFigures from CAQA used with permission of Morgan Kaufmann Publishers. © 2003 Elsevier Science (USA)

How can pipelining help?

How can pipelining improve performance?

• If we keep CT constant, by improving CPI …

• If we keep CPI constant, by improving CT …

• Usually we improve both CT and CPI.

Pipeline hazards

A hazard reduces the performance of the pipeline. Hazards arisebecause of the program’s characteristics.

There are three kinds of hazards.

• Structural hazards—Not enough hardware resources existfor all combinations of instructions.

• Data hazards—Dependences between instructions preventtheir overlapped execution.

• Control hazards—Branches change the PC, which resultsin stalls while branch targets are fetched.

IF ID EX MEM WB

Pipeline IF ID EX MEM WB

50n

IF/ID and MEM/WBare unpipelined

50n

IF ID EX MEM WBUnpipelined CPI_pipe=1

IF1 ID1 EX1MEM

1 WB1Pipelined

CPI_pipe=1IF2 ID2 EX2MEM

2 WB2

25ns

50ns

Page 8: Pipeline

Lecture 14 Advanced Microprocessor Design 8

Structural hazards

Consider a pipeline with a unified instruction-data cache.

Clock # 1 2 3 4 5 6 7 8 9 10Load instr. IF ID EX MEM WB

Instr. i+1 IF ID EX MEM WB

Instr. i+2 IF ID EX MEM WB

Instr. i+3 stall IF ID EX MEM WB

Instr. i+4 IF ID EX MEM WB

Instr. i+5 IF ID EX MEM

Instr. i+6 IF ID EX

Instruction i+3 has to stall, because the load instruction “steals” aninstruction-fetch cycle.

In this pipeline, what kind of instructions (what “opcodes”) causestructural hazards?