35
Lecture 5: Interrupts, Superscalar Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008

Lecture 5: Interrupts, Superscalar Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008

Embed Size (px)

Citation preview

Page 1: Lecture 5: Interrupts, Superscalar Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008

Lecture 5: Interrupts, Superscalar

Professor Alvin R. LebeckComputer Science 220 / ECE 252

Fall 2008

Page 2: Lecture 5: Interrupts, Superscalar Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008

2© 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Computer Science 220

Admin

• Homework #1 Due Today• Homework #2 Assigned• Reading

– H&P Chapter 2 & 3 (suggested)– Research papers (not yet ready to read, but will be soon!):

» Hinton et al: “The Microarchitecture of the Pentium 4 Processor”

» Palacharla, Jouppi, and Smith: “Complexity-Effective Superscalar Processors”

» Akkary, Rajwar, and Srinivasan: “Checkpoint Processing and Recovery”

Page 3: Lecture 5: Interrupts, Superscalar Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008

3© 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Computer Science 220

Review: Hazards

Data Hazards• RAW

– only one that can occur in simple 5-stage pipeline

• WAR, WAW• Data Forwarding (Register Bypassing)

– send data from one stage to another bypassing the register file

• Still have load use delayStructural Hazards• Replicate Hardware, schedulingControl Hazards• Compute condition and target early (delayed branch)

Page 4: Lecture 5: Interrupts, Superscalar Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008

4© 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Computer Science 220

Review: Dynamic Branch Prediction

• Solution: 2-bit counter where prediction changes only if mispredict twice:

• Increment for taken, decrement for not-taken– 00,01,10,11

• Helps when target is known before condition

T

T

T

T

NT

NT

NT

NT

Predict Taken

Predict Not Taken

Predict Taken

Predict Not Taken

Page 5: Lecture 5: Interrupts, Superscalar Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008

5© 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Computer Science 220

Review: Correlating Branches

• Idea: taken/not taken of recently executed branches is related to behavior of next branch (as well as the history of that branch behavior)

• TournamentChoose between alternative

predictors• How do you choose?

Branch address

2-bits per branch predictor

PredictionPrediction

2-bit global branch history

Page 6: Lecture 5: Interrupts, Superscalar Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008

6© 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Computer Science 220

Review: Need Address @ Same Time as Prediction

• Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken)

– Note: must check for branch match now, since can’t use wrong branch address

Predicted PC

Branch Prediction:Taken or not Taken

Procedure Return Addresses Predicted with a Stack

PC of Inst to fetch

=

0

n-1

Yes, use predicted PC

No, not branch

Page 7: Lecture 5: Interrupts, Superscalar Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008

7© 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Computer Science 220

IF ID/RF

EX MEM

WB

M1 M2 M3 M4 M5 M6 M7

A1 A2 A3 A4

FP/INT Divide Unit Not Pipelined

25 Clocks

Review: Multicycle Ops in Pipeline

Page 8: Lecture 5: Interrupts, Superscalar Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008

8© 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Computer Science 220

Interrupts and Exceptions

• Unnatural change in control flow• warning: varying terminology

– “exception” sometimes refers to all cases– “Trap” software trap, hardware trap

• Exception is potential problem with program– condition occurs within the processor– segmentation fault– bus error– divide by 0– Don’t want my bug to crash the entire machine– page fault (virtual memory…)

Page 9: Lecture 5: Interrupts, Superscalar Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008

9© 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Computer Science 220

Interrupts and Exceptions

• Interrupt is external event – devices: disk, network, keyboard, etc.– clock for timeslicing– These are useful events, must do something when they occur.

• Trap is user-requested exception– operating system call (syscall)

Page 10: Lecture 5: Interrupts, Superscalar Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008

10© 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Computer Science 220

ldaddst

divbeqld

subbne

RETT

User Program

Interrupt Handler

Handling an Exception/Interrupt• Invoke specific kernel routine

based on type of interrupt– interrupt/exception handler

• Must determine what caused interrupt

– could use software to examine each device

– PC = interrupt_handler

• Vectored Interrupts– PC = interrupt_table[i]

• Kernel initializes table at boot time

• Clear the interrupt• May return from interrupt

(RETT) to different process (e.g, context switch)

• Similar mechanism is used to handle interrupts, exceptions, traps

Page 11: Lecture 5: Interrupts, Superscalar Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008

11© 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Computer Science 220

Execution Mode

• What if interrupt occurs while in interrupt handler?– Problem: Could lose information for one interrupt

clear of interrupt #1, clears both #1 and #2– Solution: disable interrupts

• Disabling interrupts is a protected operation– Only the kernel can execute it– user v.s. kernel mode– mode bit in CPU status register

• Other protected operations– installing interrupt handlers– manipulating CPU state (saving/restoring status registers)

• Changing modes– interrupts– system calls (syscall instruction)

Page 12: Lecture 5: Interrupts, Superscalar Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008

12© 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Computer Science 220

A System Call (syscall)

TrapHandlerRETT

User Program • Special Instruction to change modes and invoke service

– read/write I/O device– create new process

• Invokes specific kernel routine based on argument

• kernel defined interface• May return from trap to

different process (e.g, context switch)

• RETT, instruction to return to user process

ServiceRoutines

Kernel

ldaddst

TA 6beqld

subbne

Page 13: Lecture 5: Interrupts, Superscalar Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008

13© 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Computer Science 220

Interrupts/exceptions

• classifying interrupts– terminal (fatal) vs. restartable (control returned to program)– synchronous (internal) vs. asynchronous (external)– user vs. coerced– maskable (ignorable) vs. non-maskable– between instructions vs. within instruction

Page 14: Lecture 5: Interrupts, Superscalar Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008

14© 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Computer Science 220

Precise Exceptions

“unobserved system can exist in any intermediate state, upon observation system collapses to well-defined state”

– 2nd postulate of quantum mechanics

• system processor, observation interrupt• what is the “well-defined” state?

– von Neumann: “sequential, instruction atomic execution” – precise state at interrupt

» all instructions older than interrupt are complete» all instructions younger than interrupt haven’t started

• implies interrupts are taken in program order• necessary for VM (why?), “highly recommended” by

IEEE

Page 15: Lecture 5: Interrupts, Superscalar Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008

15© 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Computer Science 220

Pipelining Complications

• Interrupts (Exceptions)– 5 instructions executing in 5 stage pipeline– How to stop the pipeline?– How to restart the pipeline?– Who caused the interrupt?

Stage Problem interrupts occurringIF Page fault on instruction fetch; misaligned memory access; memory-protection violationID Undefined or illegal opcodeEX Arithmetic interruptMEM Page fault on data fetch; misaligned memory

access; memory-protection violation

Page 16: Lecture 5: Interrupts, Superscalar Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008

16© 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Computer Science 220

Pipelining Complications

• Simultaneous exceptions in > 1 pipeline stage– Load with data page fault in MEM stage– Add with instruction page fault in IF stage

• Solution #1– Interrupt status vector per instruction– Defer check til last stage, kill state update if exception

• Solution #2– Interrupt ASAP– Restart everything that is incomplete

• Another advantage for state update late in pipeline!

Page 17: Lecture 5: Interrupts, Superscalar Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008

17© 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Computer Science 220

Interrupts/Exceptions are Nasty

• odd bits of state must be precise (e.g., condition codes)

• delayed branches– what if instruction in delay slot takes an interrupt?

• Out of order Writes (e.g., autoinc, multicycle ops)– must undo write (e.g., future-file, history-file)

• some machines had precise interrupts only in integer pipe

– sufficient for implementing VM (e.g., VAX/Alpha)

• Lucky for us, there’s a nice, clean way to handle precise state

– We’ll see how this is done in a couple of lectures ...

Page 18: Lecture 5: Interrupts, Superscalar Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008

18© 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Computer Science 220

Pipelining x86

• The x86 ISA has some really nasty instructions - how did Intel ever figure out how to build a pipelined x86 microprocessor?

• Solution: at runtime, “crack” x86 instructions (macro-ops) into RISC-like micro-ops

– First used in P6 (Pentium Pro)– Used in all subsequent x86 processors, including those from AMD

• What are the potential challenges for implementing this solution?

Page 19: Lecture 5: Interrupts, Superscalar Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008

19© 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Computer Science 220

Where are We

• principles of pipelining– pipeline depth: clock rate vs. number of stalls (CPI)

• hazards– structural– data (RAW, WAR, WAW)– control

• Branch prediction• multi-cycle operations

– structural hazards, WAW hazards

• interrupts– precise state

• Next up: CPI < 1

Page 20: Lecture 5: Interrupts, Superscalar Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008

20© 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Computer Science 220

Getting CPI < 1: Issuing Multiple Instructions/Cycle

• “Flynn bottleneck”– single issue performance limit is CPI = IPC = 1– hazards + overhead CPI >= 1 (IPC <= 1)

• diminishing returns from deep pipelines• solution: issue multiple instructions per cycle• Superscalar: varying no. instructions/cycle (1 to

8), scheduled by compiler (statically scheduled) or by HW (Tomasulo; dynamically scheduled)

– First superscalar IBM America → RS6000 → Power1– Pentium4, IBM PowerPC, Sun SuperSparc, DEC Alpha, HP PA-

8000

Page 21: Lecture 5: Interrupts, Superscalar Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008

21© 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Computer Science 220

Base Implementation

• statically scheduled (in-order) superscalar– executes unmodified sequential programs– Figures out on its own what can be done in parallel– e.g., Sun UltraSPARC, Alpha 21164– we’ll start with this one

– What has to change from single issue to multiple issue?

Page 22: Lecture 5: Interrupts, Superscalar Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008

22© 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Computer Science 220

CPI < 1: Issuing Multiple Instructions/Cycle

• Ex 2-way superscalar: 1 FP & 1 anything else– Fetch 64-bits/clock cycle; Int on left, FP on right– Can only issue 2nd instruction if 1st instruction issues– More ports for FP registers to do FP load & FP op in a pair

Type Pipe Stages

Int. instruction IF ID EX MEM WB

FP instruction IF ID EX MEM WB

Int. instruction IF ID EX MEM WB

FP instruction IF ID EX MEM WB

Int. instruction IF ID EX MEMWB

FP instruction IF ID EX MEMWB

• 1 cycle load delay expands to 3 instructions in SS– instruction in right half can’t use it, nor instructions in next slot

Page 23: Lecture 5: Interrupts, Superscalar Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008

23© 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Computer Science 220

Implications of Superscalar

• what is involved in– fetching two instructions per cycle?– decoding two instructions per cycle?– executing two ALU operations per cycle?– accessing the data cache twice per cycle?– writing back two results per cycle?

• what about 4 or 8 instructions per cycle?

I$D$

regfile

DF M WX

F/D D/X X/M M/WPC

BP

Page 24: Lecture 5: Interrupts, Superscalar Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008

24© 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Computer Science 220

Wide Fetch

• Fetch N instructions per cycle• if instructions are sequential...

– and on same cache line nothing really– and on different cache lines banked I$ + combining network

• if instructions are not sequential...– more difficult– two serial I$ accesses (access1predict targetaccess2)? no

• note: embedded branches OK as long as predicted NT

– serial access + prediction in parallel– if prediction is T, discard serial part after branch

• Trace Cache…

Page 25: Lecture 5: Interrupts, Superscalar Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008

25© 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Computer Science 220

Wide Decode

• Decode N instructions per cycle• actually decoding instructions?

– easy if fixed length instructions (multiple decoders)– harder (but possible) if variable length

• reading input register values?– 2N register read ports (register file read latency ~2N)– actually less than 2N, since most values come from bypasses

• what about the stall logic to enforce RAW dependences?

Page 26: Lecture 5: Interrupts, Superscalar Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008

26© 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Computer Science 220

N2 Dependence Check Logic

• remember stall logic for single issue pipeline– rs1(D) == rd(D/X) || rs1(D) == rd(X/M) || rs1(D) == rd(M/W)– same for rs2(D)– full-bypassing reduces to rs1(D) == rd(D/X) && op(D/X) == LOAD

• doubling issue width (N) quadruples stall logic!– not only 2 instructions in D, but two instructions in every stage– (rs1(D1) == rd(D/X1) && op(D/X1) == LOAD)– (rs1(D1) == rd(D/X2) && op(D/X2) == LOAD)– repeat for rs1(D2), rs2(D1), rs2(D2)– also check dependence of 2nd instruction on 1st: rs1(D2) == rd(D1)

• “N2 dependence cross-check”– for N-wide pipeline, stall (and bypass) circuits grow as N2

Page 27: Lecture 5: Interrupts, Superscalar Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008

27© 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Computer Science 220

Superscalar Stalls

• invariant: stalls propagate upstream to younger instructions

• what if older instruction in issue “pair” (inst0) stalls?– younger instruction (inst1) stalls too, cannot pass it

• what if younger instruction (inst1) stalls?– can older instruction from next group (inst2) move up?

• Rigid pipeline: No• Fluid pipeline: Yes

Page 28: Lecture 5: Interrupts, Superscalar Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008

28© 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Computer Science 220

Wide Execute

• What does it take to execute N instructions per cycle?• multiple execution units...N of every kind?

– N ALUs? OK, ALUs are small– N FP dividers? no, FP dividers are huge (and fdiv is uncommon)

• typically have some mix (proportional to instruction mix)

• RS/6000: 1 ALU/memory/branch + 1 FP– Pentium: 1 any + 1 ALU (Pentium)– Pentium II: 1 ALU/FP + 1 ALU + 1 load + 1 store + 1 branch– Alpha 21164: 1 ALU/FP/branch + 2 ALU + 1 load/store

Page 29: Lecture 5: Interrupts, Superscalar Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008

29© 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Computer Science 220

N2 Bypass

• N2 bypass logic... OK – only 5-bit quantities – compare to generate 1-bit outcomes– similar to stall logic

• N2 bypass buses... not even close to OK– 32-bit or 64-bit quantities– broadcast, route, and multiplex (mux)– difficult to lay out and route all the wires– wide (SLOW) muxes

• big design problem today

Page 30: Lecture 5: Interrupts, Superscalar Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008

30© 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Computer Science 220

One Solution to N2 Bypass: Clustering

• group functional units into clusters– full bypass within cluster– no bypass between clusters– ~(N/k) inputs at each mux– ~(N/k)2 routed buses in each cluster

• steer instructions to different clusters

– dependent instructions to same cluster– exploit intra-cluster bypass– static or dynamic steering is possible

• e.g., Alpha 21264– 4-wide, 300MHz– full bypass didn’t fit into 1 clock cycle– 2 clusters with full intra-cluster bypass

X/MD/X

Page 31: Lecture 5: Interrupts, Superscalar Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008

31© 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Computer Science 220

Wide Memory Access

• what is involved in accessing memory for multiple instructions per cycle?

• multi-banked D$– requires bank assignment and conflict-detection logic

• (rough) instruction mix: 20% loads, 15% stores– for width N, we need about 0.2*N load ports, 0.15*N store ports

Page 32: Lecture 5: Interrupts, Superscalar Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008

32© 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Computer Science 220

Wide Writeback

• what is involved in writing back multiple instructions per cycle?

• nothing too special, just another port on the register file

– everything else is taken care of earlier in pipeline

• adding ports isn’t free, though– increases area– increases access latency

Page 33: Lecture 5: Interrupts, Superscalar Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008

33© 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Computer Science 220

Multiple Issue Summary

• superscalar problem spots• fetch, branch prediction trace cache?• decode (N2 dependence cross-check)• execute (N2 bypass) clustering?

Page 34: Lecture 5: Interrupts, Superscalar Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008

34© 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Computer Science 220

Can we do better?

• Problem: Stall in ID stage if any data hazard.• Your task: Teams of two, propose a design to

eliminate these stalls.

MULD F2, F3, F4 Long latency…ADDD F1, F2, F3ADDD F3, F4, F5ADDD F1, F4, F5

Page 35: Lecture 5: Interrupts, Superscalar Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008

35© 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

Computer Science 220

Next Time

• Dynamic Scheduling• Read papers• HW #2 Assigned