53
Computer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard University Instructor: Prof. David Brooks [email protected]

Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Embed Size (px)

Citation preview

Page 1: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Computer Science 246Advanced Computer Architecture

Spring 2010Harvard University

Instructor: Prof. David [email protected]

Page 2: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Lecture Outline

• Instruction-Level Parallelism• Scoreboarding (A.8)

Page 3: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Instruction Level Parallelism (ILP)

• Quest for ILP drives significant research– Dynamic instruction scheduling

• Scoreboarding• Tomasulo’s Algorithm• Register Renaming (removing artificial dependencies

WAR/WAWs)

– Dynamic Branch Prediction– Superscalar/Multiple instruction Issue– Hardware Speculation

• All of these things fit well together

Page 4: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Dynamic Scheduling

• Scheduling tries to re-arrange instructions to improve performance

• Previously:– We assume when ID detects a hazard that cannot be

hidden by bypassing/forwarding pipeline stalls– AND, we assume the compiler tries to reduce this

• Now:– Re-arrange instructions at runtime to reduce stalls– Why hardware and not compiler?

Page 5: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Goals of Scheduling

• Goal of Static Scheduling– Compiler tries to avoids/reduce dependencies

• Goal of Dynamic Scheduling– Hardware tries to avoid stalling when present

• Why hardware and not compiler?– Code Portability– More information available dynamically (run-time)– ISA can limit registers ID space– Speculation sometimes needs hardware to work well– Not everyone uses “gcc –O3”

Page 6: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Dynamic Scheduling: Basic Idea

• ExampleDIVD F0, F2, F4ADDD F10, F0, F8SUBD F12, F8, F14

• Hazard detection during decode stalls whole pipeline• No reason to stall in these cases: Out-of-Order Execution

– Reduces stalls, improved FU utilization, more parallel execution

– To give the appearance of sequential execution: precise interrupts

– First, we will study without this, then see how to add them

Dependency

Independent

Page 7: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

How to implement?

• Previously:– Instruction Decode and Operation Fetch are a single cycle

• Now:– Split ID into two phases

• Issue – decode instructions, check for structural hazards• Read Operands – Wait until no data hazards, then read ops

– Dividing hazard checks into a two-step process

• Out-of-order execution => WAR/WAW hazardsDIVD F0, F2, F4 DIVD F0, F2, F4ADDD F10, F0, F8 ADDD F10,F0, F8SUBD F8, F8, F14 SUBD F10, F8, F14WAR WAW

Page 8: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Scoreboarding Basics

• Previously:– Instruction Decode and Operation Fetch are a single cycle

• Now:– To support multiple instructions in ID stage, we need two

things• Buffered storage (Instruction Buffer/Window/Queue)• Split ID into two phases

IF ID EX M WB

IF IS EX M WBRd

Standard Pipeline

New PipelineI-Buffer/Scoreboard

Page 9: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Scoreboarding

• Centralized scheme– No bypassing– WAR/WAW hazards

are a problem

• Originally proposed in CDC6600 (S. Cray, 1964)

Page 10: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Scoreboarding Stages – Issue(Or Dispatch)

• Fetch – Same as before• Issue (Check Structural Hazards)

– If FU is free an no other active instruction has same destination register (WAW), then issue instruction

– Do not issue until structural hazards cleared– Stalled instruction stay in I-Buffer– Size of buffer is also a structural Hazard

• May have to stall Fetch if buffer fills

– Note: Issue is In-Order, stalls stops younger instructions

Page 11: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Scoreboarding Stages –Read Operands (Or Issue!)

• Read Operands (Check Data Hazards)– Check scoreboard for whether source operands are

available– Available?

• No earlier issued active instructions will write register• No currently active FU is going to write it

– Dynamically avoids RAW hazards

Page 12: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Scoreboarding Stages –Execution/Write Result

• Execution– Execute/Update scoreboard

• Write Result– Scoreboard checks for WAR stalls and stalls completing

instruction, if necessary– Before, stalls only occur at the beginning of instructions,

now it can be at the end as well– Can happen if:

• Completing instruction destination register matches an older instruction that has not yet read its source operands

Page 13: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Scoreboarding Control Hardware

• Three main parts– Instruction Status Bits

• Indicate which of the four stages instruction is in

– Functional Unit Status Bits• Busy (In Use or not), Operation being Performed• Fi -- Destination Register, Fj, Fk, -- Source Registers• Qj, Qk – FU producing source regs Fj, Fk• Rj, Rk – Flags indicating when Fj, Fk are ready but not yet read

– Register Result Status• Which FU will write each register

Page 14: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Scoreboard ExampleInstruction status: Read Exec Write

Instruction j k Issue Oper Comp ResultLD F6 34+ R2LD F2 45+ R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Functional unit status: dest S1 S2 FU FU Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer NoMult1 NoMult2 NoAdd NoDivide No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

FU

Example courtesy of Prof. Broderson, CS152, UCB, Copyright (C) 2001 UCB

Page 15: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1LD F2 45+ R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Functional unit status: dest S1 S2 FU FU Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer Yes Load F6 R2 YesMult1 NoMult2 NoAdd NoDivide No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

1 FU Integer

Scoreboard Example: Cycle 1

Page 16: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2LD F2 45+ R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Functional unit status: dest S1 S2 FU FU Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer Yes Load F6 R2 YesMult1 NoMult2 NoAdd NoDivide No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

2 FU Integer

• Issue 2nd LD?

Scoreboard Example: Cycle 2

Page 17: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3LD F2 45+ R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Functional unit status: dest S1 S2 FU FU Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer Yes Load F6 R2 NoMult1 NoMult2 NoAdd NoDivide No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

3 FU Integer

• Issue MULT?

Scoreboard Example: Cycle 3

Page 18: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Functional unit status: dest S1 S2 FU FU Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer NoMult1 NoMult2 NoAdd NoDivide No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

4 FU Integer

Scoreboard Example: Cycle 4

Page 19: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Functional unit status: dest S1 S2 FU FU Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer Yes Load F2 R3 YesMult1 NoMult2 NoAdd NoDivide No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

5 FU Integer

Scoreboard Example: Cycle 5

Page 20: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6MULTD F0 F2 F4 6SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Functional unit status: dest S1 S2 FU FU Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer Yes Load F2 R3 YesMult1 Yes Mult F0 F2 F4 Integer No YesMult2 NoAdd NoDivide No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

6 FU Mult1 Integer

Scoreboard Example: Cycle 6

Page 21: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7MULTD F0 F2 F4 6SUBD F8 F6 F2 7DIVD F10 F0 F6ADDD F6 F8 F2

Functional unit status: dest S1 S2 FU FU Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer Yes Load F2 R3 NoMult1 Yes Mult F0 F2 F4 Integer No YesMult2 NoAdd Yes Sub F8 F6 F2 Integer Yes NoDivide No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

7 FU Mult1 Integer Add

• Read multiply operands?

Scoreboard Example: Cycle 7

Page 22: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7MULTD F0 F2 F4 6SUBD F8 F6 F2 7DIVD F10 F0 F6 8ADDD F6 F8 F2

Functional unit status: dest S1 S2 FU FU Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer Yes Load F2 R3 NoMult1 Yes Mult F0 F2 F4 Integer No YesMult2 NoAdd Yes Sub F8 F6 F2 Integer Yes NoDivide Yes Div F10 F0 F6 Mult1 No Yes

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

8 FU Mult1 Integer Add Divide

Scoreboard Example: Cycle 8a (First half of clock cycle)

Page 23: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6SUBD F8 F6 F2 7DIVD F10 F0 F6 8ADDD F6 F8 F2

Functional unit status: dest S1 S2 FU FU Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer NoMult1 Yes Mult F0 F2 F4 Yes YesMult2 NoAdd Yes Sub F8 F6 F2 Yes YesDivide Yes Div F10 F0 F6 Mult1 No Yes

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

8 FU Mult1 Add Divide

Scoreboard Example: Cycle 8b (Second half of clock cycle)

Page 24: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9DIVD F10 F0 F6 8ADDD F6 F8 F2

Functional unit status: dest S1 S2 FU FU Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer No10 Mult1 Yes Mult F0 F2 F4 Yes Yes

Mult2 No2 Add Yes Sub F8 F6 F2 Yes Yes

Divide Yes Div F10 F0 F6 Mult1 No Yes

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

9 FU Mult1 Add Divide

• Read operands for MULT & SUB? Issue ADDD?

Note Remaining

Scoreboard Example: Cycle 9

Page 25: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9DIVD F10 F0 F6 8ADDD F6 F8 F2

Functional unit status: dest S1 S2 FU FU Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer No9 Mult1 Yes Mult F0 F2 F4 No No

Mult2 No1 Add Yes Sub F8 F6 F2 No No

Divide Yes Div F10 F0 F6 Mult1 No Yes

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3010 FU Mult1 Add Divide

Scoreboard Example: Cycle 10

Page 26: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9 11DIVD F10 F0 F6 8ADDD F6 F8 F2

Functional unit status: dest S1 S2 FU FU Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer No8 Mult1 Yes Mult F0 F2 F4 No No

Mult2 No0 Add Yes Sub F8 F6 F2 No No

Divide Yes Div F10 F0 F6 Mult1 No Yes

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3011 FU Mult1 Add Divide

Scoreboard Example: Cycle 11

Page 27: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8ADDD F6 F8 F2

Functional unit status: dest S1 S2 FU FU Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer No7 Mult1 Yes Mult F0 F2 F4 No No

Mult2 NoAdd NoDivide Yes Div F10 F0 F6 Mult1 No Yes

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3012 FU Mult1 Divide

• Read operands for DIVD?

Scoreboard Example: Cycle 12

Page 28: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8ADDD F6 F8 F2 13

Functional unit status: dest S1 S2 FU FU Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer No6 Mult1 Yes Mult F0 F2 F4 No No

Mult2 NoAdd Yes Add F6 F8 F2 Yes YesDivide Yes Div F10 F0 F6 Mult1 No Yes

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3013 FU Mult1 Add Divide

Scoreboard Example: Cycle 13

Page 29: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8ADDD F6 F8 F2 13 14

Functional unit status: dest S1 S2 FU FU Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer No5 Mult1 Yes Mult F0 F2 F4 No No

Mult2 No2 Add Yes Add F6 F8 F2 Yes Yes

Divide Yes Div F10 F0 F6 Mult1 No Yes

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3014 FU Mult1 Add Divide

Scoreboard Example: Cycle 14

Page 30: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8ADDD F6 F8 F2 13 14

Functional unit status: dest S1 S2 FU FU Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer No4 Mult1 Yes Mult F0 F2 F4 No No

Mult2 No1 Add Yes Add F6 F8 F2 No No

Divide Yes Div F10 F0 F6 Mult1 No Yes

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3015 FU Mult1 Add Divide

Scoreboard Example: Cycle 15

Page 31: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8ADDD F6 F8 F2 13 14 16

Functional unit status: dest S1 S2 FU FU Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer No3 Mult1 Yes Mult F0 F2 F4 No No

Mult2 No0 Add Yes Add F6 F8 F2 No No

Divide Yes Div F10 F0 F6 Mult1 No Yes

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3016 FU Mult1 Add Divide

Scoreboard Example: Cycle 16

Page 32: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8ADDD F6 F8 F2 13 14 16

Functional unit status: dest S1 S2 FU FU Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer No2 Mult1 Yes Mult F0 F2 F4 No No

Mult2 NoAdd Yes Add F6 F8 F2 No NoDivide Yes Div F10 F0 F6 Mult1 No Yes

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3017 FU Mult1 Add Divide

• Why not write result of ADD???

WAR Hazard!

Scoreboard Example: Cycle 17

Page 33: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8ADDD F6 F8 F2 13 14 16

Functional unit status: dest S1 S2 FU FU Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer No1 Mult1 Yes Mult F0 F2 F4 No No

Mult2 NoAdd Yes Add F6 F8 F2 No NoDivide Yes Div F10 F0 F6 Mult1 No Yes

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3018 FU Mult1 Add Divide

Scoreboard Example: Cycle 18

Page 34: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9 19SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8ADDD F6 F8 F2 13 14 16

Functional unit status: dest S1 S2 FU FU Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer No0 Mult1 Yes Mult F0 F2 F4 No No

Mult2 NoAdd Yes Add F6 F8 F2 No NoDivide Yes Div F10 F0 F6 Mult1 No Yes

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3019 FU Mult1 Add Divide

Scoreboard Example: Cycle 19

Page 35: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9 19 20SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8ADDD F6 F8 F2 13 14 16

Functional unit status: dest S1 S2 FU FU Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer NoMult1 NoMult2 NoAdd Yes Add F6 F8 F2 No NoDivide Yes Div F10 F0 F6 Yes Yes

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3020 FU Add Divide

Scoreboard Example: Cycle 20

Page 36: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9 19 20SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8 21ADDD F6 F8 F2 13 14 16

Functional unit status: dest S1 S2 FU FU Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer NoMult1 NoMult2 NoAdd Yes Add F6 F8 F2 No NoDivide Yes Div F10 F0 F6 Yes Yes

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3021 FU Add Divide

• WAR Hazard is now gone...

Scoreboard Example: Cycle 21

Page 37: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9 19 20SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8 21ADDD F6 F8 F2 13 14 16 22

Functional unit status: dest S1 S2 FU FU Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer NoMult1 NoMult2 NoAdd No

39 Divide Yes Div F10 F0 F6 No No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3022 FU Divide

Scoreboard Example: Cycle 22

Page 38: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Faster than light computation

(skip a couple of cycles)

Page 39: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9 19 20SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8 21 61ADDD F6 F8 F2 13 14 16 22

Functional unit status: dest S1 S2 FU FU Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer NoMult1 NoMult2 NoAdd No

0 Divide Yes Div F10 F0 F6 No No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3061 FU Divide

Scoreboard Example: Cycle 61

Page 40: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9 19 20SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8 21 61 62ADDD F6 F8 F2 13 14 16 22

Functional unit status: dest S1 S2 FU FU Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer NoMult1 NoMult2 NoAdd NoDivide No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3062 FU

Scoreboard Example: Cycle 62

Page 41: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9 19 20SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8 21 61 62ADDD F6 F8 F2 13 14 16 22

Functional unit status: dest S1 S2 FU FU Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer NoMult1 NoMult2 NoAdd NoDivide No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3062 FU

• In-order issue; out-of-order execute & commit

Review: Scoreboard Example: Cycle 62

Page 42: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Scoreboarding ReviewLD F6 34+ R2LD F2 45+ R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

1 2 3 4 5 6 7 8 9 10 11 12 13

LD F6, 34(R2) Iss Rd Ex Wb

LD F2, 45(R3) Iss Rd Ex Wb

MULTD F0, F2, F4 Iss Iss Iss Rd M1 M2 M3 M4

SUBD F8, F6, F2 Iss Iss Rd A1 A2 Wb

DIVD F10, F0, F6 Iss Iss Iss Iss Iss Iss

ADDD F6, F8, F2 Iss

Page 43: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Scoreboarding ReviewLD F6 34+ R2LD F2 45+ R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

11 12 13 14 15 16 17 18 19 20 21 22 …. 62

LD F6, 34(R2)

LD F2, 45(R3)

MULTD F0, F2, F4 M2 M3 M4 M5 M6 M7 M8 M9 M10 Wb

SUBD F8, F6, F2 A2 Wb

DIVD F10, F0, F6 Iss Iss Iss Iss Iss Iss Iss Iss Iss Iss Rd D1 Wb

ADDD F6, F8, F2 Iss Rd A1 A2 A2 A2 A2 A2 A2 Wb

Page 44: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Scoreboarding Limitations

• Number and type of functional units• Number of instruction buffer entries (scoreboard

size)• Amount of application ILP (RAW hazards)• Presence of antidependencies (WAR) and output

dependencies (WAW)– Inorder issue for WAW/Structural Hazards limits

scheduler– WAR stalls are critical for loops (hardware loop

unrolling)

Page 45: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Tomasulo’s Approach

• Used in IBM 360/91 Machines (Late 60s)• Similar to scoreboarding, but added renaming• Key concept: Reservation Stations

• Very Important Topic– Scheduling ideas led to Alpha 21264, HP PA-8000,

MIPS R10K, Pentium III, Pentium 4, PowerPC 604, etc…

Page 46: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Reservation Stations (RS)

• Distributed (rather than centralized) control scheme– Bypassing is allowed via Common Data Bus (CDB) to RS– Register Renaming eliminates WAR/WAW hazards

• Scoreboard/Instruction Buffer => Reservation Stations– Fetch and Buffer operands as soon as available

• Eliminates need to always get values from registers at execute

– Pending instructions designate reservation stations that will provide their inputs

– Successive writes to a register cause only the last one to update the register

Page 47: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Register Renaming

• Compiler can eliminate some WAW/WAR “false”hazards, but not all– Not enough registers– Hazards across branches (common!) – can eliminate on

taken, or fall through but not both– Hazards with itself -- dynamic loops (example later)

• Example (spill code causing “false hazards”)ADD R3, R1, R2SW R3, 0(R4)SUB R3, R1, R2

C = A + BD = A - B

Page 48: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Register Renaming

• Dynamically change register names to eliminate “false dependencies” (WAR/WAW hazards)

• Architectural registers: Names not Locations– Many more locations (“reservation stations” or “physical

registers”) than names (“logical or architectural registers”)– Dynamically map names to locations

Page 49: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Register Renaming Example

DIV F0, F2, F4ADD F6, F0, F8SW F6, 0(R1)SUB F8, F10, F14MUL F6, F10, F8

DIV F0, F2, F4ADD S, F0, F8SW S, 0(R1)SUB T, F10, F14MUL F6, F10, T

Assume temporary registers S and T

Page 50: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Register Renaming with Tomasulo

• At instruction issue:– Register specifiers for source operands are renamed to the

names of the reservation stations– Values can exist in reservation station or register file

• To eliminate WARs, register file values are copied to reservation stations at issue

• Other methods example use pointer-based renaming (map-table)

• Technique used in Pentium III, PowerPC604

Page 51: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Reservation Station Components• Op: Operation to perform in the unit• Qj, Qk: Reservation stations producing source registers

(value to be written)– Note: No ready flags needed as in Scoreboard– Qj,Qk=0 => ready– Store buffers only have Qi for RS producing result

• Vj, Vk: Value of Source operands– Store buffers has V field, result to be stored

• Busy: Indicates reservation station or FU are occupied• Register Result Status: Indicates which functional unit will

write each register, if one exists. Blank when no pending instructions that will write that register.

Page 52: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

Three Stages of Tomasulo Algorithm

1. Issue—get instruction from FP Op QueueIf reservation station free (no structural hazard), control issues instr & sends operands (renames registers).

2.Execution—operate on operands (EX)When both operands ready then execute;if not ready, watch Common Data Bus for result

3.Write result—finish execution (WB)Write on Common Data Bus to all awaiting units; mark reservation station available

Page 53: Computer Science 246 Advanced Computer …dbrooks/cs246/cs246-scoreboarding.pdfComputer Science 246 David Brooks Computer Science 246 Advanced Computer Architecture Spring 2010 Harvard

Computer Science 246David Brooks

For next time

• Tomasulo’s Algorithm– Section 3.2/3.3 of H&P

• Branch Prediction– Section 3.4/3.5 of H&P