Introduction to PipelineCS510 Computer ArchitecturesLecture 6 - 1 Lecture 6 Introduction to...

Introduction to Pipeline CS510 Computer Architectures Lecture 6 - 1

Lecture 6Lecture 6

Introduction to Introduction to PipeliningPipelining

Lecture 6Lecture 6

Introduction to Introduction to PipeliningPipelining

Laundry Example• Ann, Brian, Cathy, Dave

each have one load of clothes to wash, dry, and fold

A B C D

Pipelining: Its Natural!Pipelining: Its Natural!

• Washer takes 30 minutes

• Dryer takes 40 minutes

• Folder takes 20 minutes

Sequential LaundrySequential Laundry

30 40 20 30 40 20 30 40 20 30 40 20

6 PM 7 8 9 10 11 MidnightTime

If they learned pipelining, how long would laundry take?Sequential laundry takes 6 hours for 4 loads

Pipelined LaundryPipelined LaundryStart Work ASAPStart Work ASAP

30 40 40 40 40 20

6 PM 7 8 9 10 11 Midnight

Pipelined laundry takes 3.5 hours for 4 loads

Pipelining LessonsPipelining Lessons• Pipelining doesn’t help

latency of single task, it helps throughput of entire workload

• Pipeline rate is limited by the slowest pipeline stage

• Multiple tasks operating simultaneously

• Potential speedup = Number pipe stages

• Unbalanced lengths of pipe stages reduce speedup

• Time to “fill” pipeline and time to “drain” it reduces speedup

6 PM 7 8 9

30 40 40 40 40 20

Filling

Draining

DLX InstructionsDLX InstructionsInstruction type/ Instruction meaning Opcode

Data transfers Only memory address mode is 16-bit disp + contents of a GPR LB, LBU, SB Load byte, load byte unsigned, store byte

LH, LHU, SH Half word

LW, SW Word(to/from integer registers)

LF, LD, SF, SD Load SP float, load DP float, store SP float, store DP float

MOVI2S, MOVS2I Move from/to GPR to/from a special register

MOVF, MOVD Copy one FP register or a DP pair to another register or pair

MOVFP2I, MOVI2FP Move 32 bits from/to FP registers to/from integer registers

Arithmetic/logicalADD, ADDI, ADDU, ADDUI Add, add immediate(16 bits); signed and unsigned

SUB, SUBI, SUBU, SUBUI Subtract

MULT, MULTU, DIV, DIVU Multiply and divide, signed and unsigned; operands must be FP regs; all operations take and yield 32-bit values

AND, ANDI And, and immediate

OR, ORI, XOR, XORI OR, Exclusive-OR

LHI Load high immediate --- load upper half of register with immediate

DLX instructionsDLX instructionsShiftSLL, SRL, SRA, SLLI, Shifts: both immediate(S__I) and variable form (S__); logical, arithmetic

SRLI, SRAI

S__, S__I Set conditional: “__” may be LT, GT, LE, GE, EQ, NE

Control Conditional branches and jumps; PC-relative or through register

BEQZ, BNEZ Branch GPR equal/not equal to zero: 16-bit offset from PC+4

BFPT, BFPF Test comparison bit in the FP status register and branch; 16-bit offset

J, JR Jumps:26-bit offset or target in register

JAL, JALR Jump and link: save PC+4 in R31

TRAP Transfer to operating system at a vectored address

RFE Return to user code from an exception; restore user mode

Floating point FP operations on DP and SP formatFcnD, FcnF Fcn: ADD, SUB, MULT, DIV

CVTF2D, CVTF2I,Convert instructions: F single precision, D double precision, I integer

CVTD2F, CVTD2I, Both operands are FPRs

CVTI2F, CVTI2D,

__D, __F DP and SP compares: “__” = LT, GT, LE, GE, EQ, NE; sets bits in FP status register

DLX Instruction FormatDLX Instruction Format

Opcode rs1 rd Immediate

6 5 5 16

I - type instruction

Loads, stores, all immediates, conditional branches, Jump register, jump and link reg

R - type instruction5 11

Opcode rs1 rs2 rd func

Register-register ALU operations: Func - Add, Sub,...

Opcode

J - type instruction

Offset added to PC

Jump and Jump and link, trap and return from exception

5 Steps of DLX Instr. Execution:5 Steps of DLX Instr. Execution:

Step1Step1

Step 1: Instruction fetch cycle (IF)– Read instruction from memory and store into IR

• IR Mem[PC]

– Calculate the next instruction address• NPC PC+4• 1 instruction is stored in consecutive 4 bytes

Instr.Memory

Step2Step2

Step 2: Instruction decode/register fetch cycle (ID)– Read source registers to A and B

A Regs[IR6..10]B Regs[IR11..15]

– Make 16 bits sign extension of 16-bit immediate field to make a 32-bit immediate value

Imm ((IR16)16## IR16..31)

– Decoding is done in parallel: fixed-field decoding b Rd

SignExt

RegFile

Step 3Step 3

Step 3: Execution/effective address cycle (EX):– Memory reference: Effective Address calculation

» ALUOutput A + Imm

– Register-register ALU instruction: Perform ALU operation with R’s» ALUOutput A func B; func B

– Register-Immediate ALU instruction: Perform ALU operation with

immediate operand» ALUOutput A op Imm

– Branch: Effective Address calculation for branch target address

Determine condition code» ALUOutput NPC + Imm; Cond (A op 0)

Step 3 EXStep 3 EX

ALUOut

5 Steps of DLX Instr. Execution:5 Steps of DLX Instr. Execution: Step 4Step 4

Step 4: Memory access/branch completion cycle (MEM):– Memory reference : Access memory either

• for LD: LMD Mem[ALUOutput] or• for ST: Mem[ALUOutput] B

– Branch : Test Condition • if (cond) PC ALUOutput,

else PC NPC;

DataMemory

ALUOut

5 Steps of DLX Instr. Execution:5 Steps of DLX Instr. Execution: Step 5Step 5

Step 5: Write-back cycle (WB):Reg-Reg ALU : Store the result into the destination register

Regs[IR16..20] ALUOutput;

Reg-Immediate ALU : Store the result into destination registerRegs[IR11..15] ALUOutput;

Load instruction: Store the data read from memory to the destination register

Regs[IR11..15] LMD;

ALUOut

RegisterFile

5 Steps of DLX Datapath5 Steps of DLX Datapath

MEM Stage

WB StageIF Stage ID Stage EX Stage

Instr.Memory

SignExt

DataMemory

ALURegFile

ALUOutput

A Simple ImplementationA Simple Implementation

• A multi-cycle implementation

– needs temporary registers-- NPC, IC, A, B, Imm, Cond, ALUOutput, LMD

– CPI improvements: Branch - 4 cycles, ALU - 4 cycles

• if brach freq : 12 %, ALU instr. freq : 44%

CPI = 0.12 x 4 + 0.44 x 4 + 0.44 x 5 = 4.44

• A single-cycle implementation

– one long clock cycle

– very inefficient for most machines that have a reasonable variation among the amount of work

– requires the duplication of FU that could be shared in a multi-cycle implementation

MR-instructions

Visualizing PipelineVisualizing Pipeline

IM Reg

DM Reg

IM Reg

DM Reg

IM Reg

DM Reg

IM Reg

DM Reg

IM Reg

DM Reg

Instru

Time(clock cycles)

CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

FillingFilling

DrainingDraining

Saving Information Produced Saving Information Produced by Each Stage of Pipelineby Each Stage of Pipeline

• Information need to be stored at the end of a clock cycle, otherwise it will be lost

• Each pipeline stage produces information(data, address, and control) at the end of the clock cycle

• Thus, we need a storage(called inter-stage buffer) at end of each pipeline stage

• F/D Buffer

– IR, NPC

• D/A Buffer

– A, B, Imm, b(destination Reg address to store result), OP(OP-code), cond

– NPC

• A/M Buffer

– ALUout(arithmetic result or effective address)

– NPC, cond, b, OP

• M/W Buffer

– LMD(data for LD)

– ALUout(arithmetic result), b, OP

Inter-Stage Buffer Inter-Stage Buffer in DLX Pipelinein DLX Pipeline

Pipelined DLX DatapathPipelined DLX Datapath- Multicycle -- Multicycle -

IF Stage

Instr.Memory

MEM Stage

EX Stage

DataMemory

WB Stage

ID Stage

SignExt

RegFile

ReminderReminder

• In conventional Single Port Memory, Instruction Memory and Data Memory are the same memory

– Both IF and Mem stages use memory

– One instruction uses the same hardware resource in two different cycles

– Two instructions try to use the same hardware resource in different stages of pipeline at the same time

• For Branch instructions, Branch Target Address is available in the Mem stage

Introduction to PipelineCS510 Computer ArchitecturesLecture 6 - 1 Lecture 6 Introduction to...

Documents

Pipelining and Retiming Prepared by Mark Jarvin. Agenda Synchronous circuit retiming Pipelining Software pipelining

Multiphase Pipelining

Carnegie Mellon Lecture 8 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm…

unit3 pipelining

Recap (Pipelining)

Lecture 7: Introduction to Pipelining, Structural Hazards ...bnrg.eecs.berkeley.edu/~randy/Courses/CS252.S96/Lecture07.pdf · RHK.SP96 1 Lecture 7: Introduction to Pipelining, Structural

Pipelining & Parallel Processing - KAISTics.kaist.ac.kr/ee877_2015s/3_Pipelining_and_Parallel_Processing.pdf · Pipelining processing By using pipelining latches to reduce critical

EECS 322 Computer Architecture Introduction to Pipelining

PIPELINING basics - · PIPELINING basics • A pipelined architecture for MIPS • Hurdles in pipelining • Simple solutions to pipelining hurdles • Advanced pipelining

Pipelining & Parallel Processing - ics.kaist.ac.krics.kaist.ac.kr/ee878_2018f/[EE878]3 Pipelining and Parallel Processing.pdf · Pipelining processing By using pipelining latches

Pipelining Verilog

MIPS Pipelining

CS152 – Computer Architecture and Engineering Lecture 10 – Introduction to Pipelining

Pipeline ComplicationsCS510 Computer ArchitecturesLecture 8 - 1 Lecture 8 Advanced Pipeline

Todayʼs Menu Multi-Cycle Exceptions Exceptions ... · 13 Pipelining Multicycle Pipelining Let’s build cars 14 Pipelining Can we go faster? Pipelining: Production assembly lines

Lecture: Pipelining Basicscs6810/pres/14-6810-03.pdf · Lecture: Pipelining Basics • Topics: Performance equations wrap-up, Basic pipelining implementation Video 1: What is pipelining?

Instruction Set ArchitectureCS510 Computer ArchitecturesLecture 4 - 1 Lecture 4 Instruction Set Architecture

CostCS510 Computer ArchitecturesLecture 2 - 1 Lecture 2 Cost Cost

Control Unit ImplementationCS510 Computer ArchitecturesLecture 5- 1 Lecture 5 Control Unit Implementation

Pipelining and Parallel Processing in IIR Digital Filtersmbolic/elg6163/lee.pdf · Outline • Introduction • Pipelining in 1st-Order IIR Digital Filters • Pipelining in Higher-Order