electronspick.files.wordpress.com · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

Computer Architecture

• "Architecture"

- The art and science of designing and constructing buildings - A style and method of design and construction - Design, the way components fit together

• Computer Architecture

- The overall design or structure of a computersystem, including the hardware and the software required to run it, especially the internal structure of the microprocessor

Prerequisites

• Computer organization

- Digital logic - Memory chips, number representation - Computer arithmetic, adders, ripple-carry... - I/O organization - Peripherals - Pipelining, RISC

Course Contents

• Performance and CPI, benchmarks, Amdahl's law • Pipelining, hazards • Instruction Level Parallelism: Scoreboarding, Tomasulo's algorithm • Dynamic branch prediction, VLIW, software pipelining • Cache and memory systems • I/O systems, RAID, benchmarks • Multiprocessors, cache consistency protocols • Processor networks • Vector processors

Course References

• "Computer Architecture: A Quantitative Approach" , 2 nd edition, David A. Patterson and John L. Hennessy, Morgan Kaufmann Publishers• CS252, Graduate Computer Architecture, U.C.Berkeley

Computer Architecture

• Design aspects:

- Instruction set- Cache and memory hierarchy- I/O, storage, disk- Multi-processors, networked-systems

• Criteria: performance, cost, end-applications, complexity

Technology Trends

• Since 1970s: Microprocessor-based• Several PCs/Workstations put together can buy more cycles for the same cost

- The Berkeley NOW projec

• Transistor density: 50% per year• DRAM density: 60% per year• Magnetic disk density: 50% per year• Software:

- More memory usage- High-level language

• Growth rate in CPU speed: 50% per year

- Architectural ideas: pipelining, caching, out of order execution, sophisticated compilers

• Trends are important:

- Product cycle is 4 years!- Also beware of technology thresholds

Cost Trends

• Cost depends on various factors:

- Time, volume, competition

• Cost of IC:

- Cost of die + Testing + Packaging

• Cost of die: Wafer-cost/Dies-per-wafer• Yield is an important factor• Cost proportional to Die-area^4

Upcoming Topics

• Performance metrics, CPI• Amdahl's law

Performance Comparison

• What performance metric to use?

- User cares about response time- Performance is inversely proportional

• What is execution time?

- Response time- CPU time: User time + System time

• System performance vs. CPU performance

- Throughput vs. response-time

• We will focus on CPU performance

Which Program's Execution Time?

• Real "workload" is ideal• Practical options:

- Real programs: compilers, office-suite, scientific...- Kernels: key pieces of programs

• Example: Livermore loops

- Toy benchmarks: small programs

• Examples: Quick-sort, tower of Hanoi...

- Synthetic benchmarks: try to capture "average" frequency of instructions in real programs

• Example: Whetstone, Dhrystone

More on Performance Comparisons...

• Caveat of benchmarks

- They are needed- But manufacturers tend to optimize for benchmarks- Need to be updated periodically

• Benchmark suite: collection of programs

- E.g. SPEC92

• Reporting performance

- Reproducibility: program version, compiler, flags- SPEC specifies compiler flags for baseline comparison

Some Numerics...

Computer A Computer B Computer C

Program P1 (secs) 1 10 20Program P2 (secs) 1000 100 20

Total (secs) 1001 110 40

- Total (or average) execution time is a possible metric- Weighted execution time is better

Normalizing the Performance

Norm

(A)

Norm

(A)

Norm

(A)

Norm

(B)

Norm

(B)

Norm

(B)

Norm

(C)

Norm

(C)

Norm

(C)A B C A B C A B C

P1 1 10 20 0.1 1 2 0.05 0.5 1P2 1 0.1 0.02 10 1 0.2 50 5 1Am 1 5.05 10.01 5.05 1 1.1 25.03 2.75 1

• Normalize such that all programs take the same time, on some machine• Arithmetic mean predicts performance• Geometric mean?

Summary

• Performance inversely proportional to execution-time

- We are concerned with CPU time of unloaded machine

• Weighted execution time with weights from real workload is ideal• Else, normalize w.r.t one machine

Amdahl's Law

• Amdahl's law:

- Diminishing returns - Limit on overall speedup

• Corollary: make the common case fast

Next Lecture

• CPI as a measure of performance• Illustration of Amdahl's law

Amdahl's Law

• Amdahl's law: 1-F

F

- Diminishing returns - Limit on overall speedup

1-FF/Speedup

Overall speed up

• Corollary: make the common case fast

Illustrating Amdahl's Law

• Example: implement cache, or faster ALU?

- Cache improves performance by 10x- ALU improves performance by 3x

• Depends on fraction of instructions

- Suppose F mem = 0.2 , F alu = 0.5, F other = 0.3

Speedup with cache =

Speedup with faster ALU =

Example continued...

• Fixing F alu = 0.5 for what value of F mem is adding a cache better?

The CPU Performance Equation

CPU time = Num.clock cycles X Clock cycle time

or

CPU time = Num.of clock cycles Clock rate

FOR a Program

Num.of clock cycles

= Instruction Count X Cycles Per Instruction

= IC X CPI

Putting these together

CPU time =IC X CPI X Cycle time

More on the Equation

• This form is convenient

- Involves many relevant parameters

• Remembering is easy

• With CPI as the independent variable

Other Convenient Forms of the Equation

• Number of clock cycles can be counted as:

• Calculating in terms of

Usefulness of the Equation

• easier to measure than

- Equivalently, is measured through

• Equation includes relevant parameters such as the cycle time

Announcements

• Course web-page is up

http://web.cse.iitk.ac.in/~cs422/index.html

• Lecture scribe notes:

- HTML please- Lec-notesXY-1.html or lec-notesXY-2.html- Images in directory "images/"

• lecXY-1-anything.ext or lecXY-2-anything.ext

- Please email to one of the TAs

Instruction Set

• Interface design

- Central part of any system design

- Allows abstraction/independence

- Challenges:

• Should be easy to use by the layer above• Should allow efficient implementation by the layer below

Instruction Set Architecture (ISA)

• Main focus of early designs (1970s, 1980s)• Mutual dependence between ISA design and:

- Machine organization - Example: caches- Higher level languages and compilers (what instructions do they want?)- Operating systems

• Example: atomic instructions, paging...

The Design Space

Other design choices: determining branch conditions, instruction encoding

Classes of ISAs

GPR Advantages

• Registers faster than memory• Code density improves• Easier for compiler to use

- Hold variables- Expression evaluation- Passing arguments

Spectrum of GPR Choices

• Choices based on

- How many memory operands allowed- How many total operands

Number of memory

addressesMaximum number of

operands allowedExample

0 3 SPARC, MIPS, PowerPC1 2 80x86, Motorola2 2 VAX

mory Addressing

• Little-Endian Versus Big-Endian

• Aligned versus nonaligned access of memory units > 1 byte

- Misaligned ==> more memory cycles for

Addressing Modes

Addressing mode Example MeaningImmediate Add R4, #3 R4 <-- R4 + 3Register Add R4, R3 R4 <-- R4 + R3Direct or absolute Add R1, (1001) R1 <-- R1 + M[1001]Register deferred or indirect Add R4, (R1) R4 <-- R4 + M[R1]Displacement Add R4, 100(R1) R4 <-- R4 + M[100+R1]Indexed Add R3, (R1+R2) R3 <-- R3 + M[R1+R2]Auto-increment Add R1, (R2)+ R1 <-- R1 + M[R2]; R2 <-- R2 + d;Auto-decrement Add R1, -(R2) R2 <-- R2 - d; R1 <-- R1 + M[R2]Scaled Add R1, 100(R2)[R3] R1 <-- R1 + M[100+R2+R3*d]

Usage of Addressing Modes

How many Bits for Displacement?

How many Bits for Immediate?

Type and Size of Operands

Summary so far

• GPR is better than stack/accumulator• Immediate and displacement most used memory addressing modes• Number of bits for displacement: 12-16 bits• Number of bits for immediate: 8-16 bits• Next: what operations in instruction set?

Deciding the Set of Operations

80x86 instruction Integer averageLoad 22.00%Conditional branch 20.00%Compare 16.00%Store 12.00%Add 8.00%And 6.00%Sub 5.00%Move reg-reg 4.00%Call 1.00%

Simple instructions are used most!

Instructions for Control Flow

Design Issues for Control Flow Instructions

• PC-relative addressing

- Useful since most jumps/branches are nearby- Gives position independence (dynamic linking)

• Register indirect jumps

- Useful for many programming language features- Case statements, virtual functions, dynamic libraries

• How many bits for PC displacement?

- 8-10 bits are enough

What is the Nature of Compares?

Compare and Branch: Single Instruction or Two?

• Condition Code: set by ALU

- Advantage: simple, may be free- Disadvantage: extra state across instructions

• Condition register: test any register with result of comparison

- Advantage: simple- Disadvantage: uses up a register

• Compare and branch:

- Advantage: lesser instructions- Disadvantage: too much work in an instruction

Managing Register State during Call/Return

• Caller save, or callee save?

- Combination of the two is possible

• Beware of global variables in registers!

Instruction Encoding Issues

• Need to encode: operation, and addressing mode of each operand

- Opcode is used for encoding operation- Simple set of addressing modes ==> can encode addressing mode also in opcode- Else, need address specifier per operand!

• Challenges in encoding:

- Many registers and addressing modes- But, also minimize average instruction size- Encoding should be easy to handle in implementation (e.g. multiple of bytes)

Styles of Encoding

Opcode Address-1 Address-2 Address-3

Fixed (e.g. DLX, MIPS, Power Pc)_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Opcode, #operands

Addr. Spec-1 Address-1 Addr.

Spec-2Address-2 ...

Variable (e.g. VAX)_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Fixed: Hybrid approach: reduce

(+)ease of decoding(-)more of instruction

Variable in size, but providemultiple encoding lengths

Variable: (+) lesser number of instructions Examples: Intel 80x86(-) variance of amount of work on instruction

The Role of the Compiler

• Compilers are central to ISA design

DLX

• DLX pronounced "Deluxe"• Has the features of many recent experimental and commercial machines

• [ AMD 29K, DEC station 3100, HP 850, IBM 801, Intel i860, MIPS M/120A, MIPS M/1000,Motorola 88K, RISC I, SGI 4D/60, SPARCstation-1, Sun-4/110, Sun-4/260 ] /13 = 560 = DLX (Roman)• Good architectural features (e.g. simplicity), easy to understand

DLX Architecture: Registers and Data Types

• Has 32 32-bit GPRs: R0...R31• Also, FP registers

- 32 single precision: F0...F31- Or, 16 double precision: F0, F2, ... F30

• Value of R0 is always ZERO!• Data types:

- Integer: bytes, half-words, words- FP: single/double precision

DLX Memory Addressing

• Uses 32-bit, big-endian mode• Addressing modes:

- Only immediate and displacement, with 16-bit fields

• Register deferred?

• Place zero in displacement field

• Absolute

• Use R0 for the register

DLX Instruction Format

Opcode

(6)

RSI

(5)

RD

(5)

Immediate

(16)

I-type instruction: loads, stores, all immediate, conditional, branch, jump register, jump and link register_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ __ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Opcode RSI RS2 RD Func

(6) (5) (5)(5)

(11 )

R-type instruction: register-register ALU operations_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ __ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Opcode

(6)

Offset relative to Pc

(26)

J-type instruction: jump, jump and link, trap and return

DLX Operations

• Four classes: Load/store, ALU, branch, FP• ALU instructions are register-register• R0 used to synthesize some operations:

- Examples: loading a constant, reg-reg move

• Compares "set" a register• Jump and link pushes next PC onto R31• FP operations in single/double precision• FP compares set a bit in a special status reg• FP unit also used for integer multiply/divide!

DLX Performance: MIPS vs VAX

Pipelining

• Its natural!• Laundry example... (Randy Katz's slides)• DLX has a simple architecture

- Easy to pipeline

• Pipelining speedup:

- Can be viewed as reduction in CPI- Or, reduction in clock cycle

• Defining clock cycle as the amount of time between two successive instruction completions

A Simple DLX Implementation

• Instruction Fetch (IF) cycle:

- IR <-- M [PC]- NPC <-- PC + 4

• Instruction Decode (ID) cycle:

- Done in parallel with register read (fixed field decode)- Register/Immediate read:

• A <-- R [IR6..10]• B <-- R[IR11..15]• Imm <-- sign-extend (IR16..31)

• Execution/effective address (EX) cycle:

- Memory reference:

• ALUOutput <-- A + Imm

- Register-register ALU instruction:

• ALUOutput <-- A func B

- Register-immediate ALU instruction:

• ALUOutput <-- A op Imm

- Branch:

• ALUOutput <-- NPC + Imm• Cond <-- A op 0 [op is one of == or !=]

• Memory access/branch completion (MEM) cycle:

- Memory access:

• LMD <-- M[ALUOutput]• Or, M[ALUOutput] <-- B

- Branch: PC = (cond)? ALUOutput : NPC

• Write-back (WB) cycle:

- Reg-reg ALU opn: R[IR16..20] <-- ALUOutput- Reg-imm ALU opn: R[IR11..15] <-- ALUOutput- Load instruction: R[IR11..15] <-- LMD

The DLX Data-path

Further lectures...

• Pipelining this data-path• Pipelining issues

ISA Design to Help the Compiler

• Regularity: operations, data-types, and addressing modes should be orthogonal; no special registers/operands for some instructions

• Provide simple primitives: do not optimize for a particular compiler of a particular language

• Clear trade-offs among alternatives: how to allocate registers, when to unroll a loop...

What lies ahead...

• The DLX architecture• DLX: simple data-path• DLX: pipelined data-path• Pipelining hazards, and how to handle them

DLX Unpipelined Implementation

• Five cycles: IF, ID, EX, MEM, WB

- Branch and store instructions: 4 cycles only- What is the CPI?

F branch = 0.12, F store = 0.05

CPI = 0.17 4 + 0.83 5 = (5 - 0.17) = 4.83

• Further reduction in CPI (without pipelining)• ALU instructions can finish in 4 cycles too

F ALU = 0.47 CPI = (4.83 - 0.47) = 4.36

Speedup = 4.83 4.36 = 1.1

Some Remarks

• Any further reduction in CPI will likely increase cycle time• Some hardware redundancies can be eliminated

- Use ALU for (PC+4) addition also- Same I-cache and D-cache

• These are minor improvements...

- An alternative single-cycle implementation:

• Variation in amount of work ==> higher cycle time• Hardware unit reuse is not possible

The Basic Pipeline for DLX

CCI CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9l IF ID EX MEM WB

l+1 IF ID EX MEM WBl+2 IF ID EX MEM WBl+3 IF ID EX MEM WB IF1+4 IF ID EX MEM WB

• That is it?• Complications:

- Resource conflicts, Register conflicts, Branch instructions- Exceptions, Instruction set issues

The Pipelined Data-path

Some Performance Numerics...

Unpipeline clock cycle = 10ns

CPI ALU = CPI Branch = 4, CPI Other = 5

F ALU = 0.4, F Branch = 0.2, F Other = 0.4

Pipelined clock cycle = 11ns

Speedup

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

For a single - cycle impl.T IF = 10ns ,T ID = 8ns

T EX = 10ns, T MEM = 10ns, T WB - 7ns

Speedup from multi-cycle implementation =

Pipeline Hazards

• Structural Hazards: resource conflict

- Example: same cache/memory for instruction and data

• Data Hazards: same data item being accessed/written in nearby instructions

- Example:

• ADD R1, R2, R3• SUB R4, R1, R5

• Control Hazards: branch instructions

Structural Hazards

• Usually happen when a unit is not fully pipelined

- That unit cannot churn out one instruction per cycle

• Or, when a resource has not been duplicated enough

- Example: same I-cache and D-cache

- Example: single write-port for register-file

• Usual solution: stall

- Also called pipeline bubble , or simply bubble

Stalling the Pipeline

CCI CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10Load IF ID EX MEM WB1+1 IF ID EX MEM WB1+2 IF ID EX MEM WB1+3 STALL IF ID EX MEM WB1+4 IF ID EX MEM WB

• What is the slowdown due to stalls caused by such load instructions?

CPI without stalls = 1

CPI with stalls = 1+ F load

Slowdown = 1 + F load

Why Allow Structural Hazards?

• Lower Cost:

- Lesser hardware ==> lesser cost

• Shorter latency of unpipelined unit

- May have other performance benefits- Data hazards may introduce stalls anyway!

• Suppose the FP unit is unpipelined, and the other instructions have a 5-stage pipeline.What percentage of instructions can be FP, so that the CPI does not increase? 20% can be FP, assuming no clustering of FP instructions Even if clustered, data hazards may introduce stalls anyway

Data Hazards

• Example:

• ADD R1 , R2, R3• SUB R4, R1 , R5• AND R6, R1 , R7• OR R8, R1 , R9• XOR R10, R1 , R11

• All instructions after ADD depend on R1• Stalling is a possibility

- Can we do better?

Register File: Reads after Writes

Minimizing Stalls via Forwarding

Data Forwarding for Stores

Data Hazard Classification

• Read after Write (RAW): use data forwarding to overcome

• Write after Write (WAW): arises only when writes can happen in different pipeline stages

CC1 CC2 CC3 CC4 CC5 CC6LW R1, 0(R2) IF ID EX MEM1 MEM2 WBADD R1, R2, R3 IF ID EX WB

- Has other problems as well: structural hazards

Write after Write (WAR): rare

CC1 CC2 CC3 CC4 CC5 CC6SW 0(R1), R2 IF ID EX MEM1 MEM2 WBADD R2, R3, R4 IF ID EX WB

Stalls due to Data Hazard

Avoiding such Stalls

• Compiler scheduling:

- Example: a = b + c ; d = e + f ;

• LW R1, b • LW R2, c Without such schedulig, • LW R10, e what is the slow-down? • ADD R4, R1, R2 • LW R11, f

1 + F loads causing stalls • SW a, R4 • ADD R12, R10, R11 • SW d, R12

Topics for Next Lecture

• Control hazards• Exceptions during a pipeline

- More difficult to deal with- Cause more damage

Recall: Data Hazards

• Have to be detected dynamically, and pipeline stalled if necessary• Instruction issue: process of moving the instruction from ID stage to EX• For DLX, all data hazards can be checked before instruction issue

- Also, control for data forwarding can be determined

- This is good since instruction is suspended before any machine state is updated

Opcode of ID/EX(ID/EX.IRO..5)

Opcode of IF/ID(IF/ID.IRO..5)

Check for interlock

Load Reg-Reg ALU ID/EX.IR11.15==IF/ID.IR6..10

Load Reg-Reg ALU ID/EX.IR11.15==IF/ID.IR11..15

LoadLoad, store, ALU immediate, or branch

ID/EX.IR11.15==IF/ID.IR6..10

Control Logic for Data-Forwarding

• Data forwarding always happens

- From ALU or data-memory output- To ALU input, data-memory input, or zero detection unit

• Which registers to compare?

• Compare the destination register field in EX/MEM and MEM/WB latches with the source register fields of IR in ID/EX and EX/MEM stages

Control Hazard

• Result of branch instruction not known until end of MEM stage• Naïve solution: stall until result of branch instruction is known

- That an instruction is a branch is known at the end of its ID cycle- Note: "IF" may have to be repeated

CCI CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9Branch IF ID EX MEM WBBranch succ IF STALL STALL IF ID EX MEM WBBranch succ+1 IF ID EX MEM

Reducing the Branch Delay

• Three clock cycles wasted for every branch ==> significantly bad performance• Two things to speedup:

- Determine earlier, if branch is taken- Compute PC earlier

• Both can be done one cycle earlier• But, beware of data hazard

Branch Behaviour of Programs

• Integer programs: 13% forward conditional, 3% backward conditional, 4% unconditional• FP programs: 7%, 2%, and 1% respectively• 67% of branches are taken

- 60% forward branches are taken- 85% backward branches are taken

Handling Control Hazards

• Stall: Naïve solution• Predict untaken or Predict not-taken:

- Treat every branch as not taken- Only slightly more complex- Do not update machine state until branch outcome is known- Done by clearing the IF/ID register of the fetched instruction

Predict Untaken Scheme

CCI CC2 CC3 CC4 CC5 CC6 CC7 CC8I (Untaken branch) IF ID EX MEM WBI + 1 IF ID EX MEM WBI + 2 IF ID EX MEM WBI + 3 IF ID EX MEM WB

CCI CC2 CC3 CC4 CC5 CC6 CC7 CC8I (Taken branch) IF ID EX MEM WBI + 1 IF Noop Noop Noop NoopTarget IF ID EX MEM WBTarget + 1 IF ID EX MEM WBTarget + 2 IF ID EX MEM

More Ways to Reduce Control Hazard Delays

• Predict taken:

- Treat every branch as taken- Not of any use in DLX since branch target is not known before branch condition anyway

• May be of use in other architectures

• Delayed branch:

- Instruction(s) after branch are executed anyway!- Sequential successors are called branch-delay-slots

Delayed Branch

EITHER OR CCI CC2 CC3 CC4 CC5 CC6 CC7 CC8I (Untaken branch)

I (Taken branch)

IF ID EX MEM WB

I + 1(Brach delay

I + 1(Brach delay)

IF ID EX MEM WB

I + 2 Target IF ID EX MEM WBI + 3 Target +1 IF ID EX MEM WBI + 4 Target +2 IF ID EX MEM

• DLX has one delay-slot• Note: another branch instruction cannot be put in delay-slot• Compiler has to fill the delay-slots

Filling the Delay-Slot: Option 1 of 3

• Fill the slot from before the branch instruction • Restriction: branch must not depend on result of the filled instruction • Improves performance: always


• Fill the slot from the target of the branch instruction • Restriction: should be OK to execute instruction even if not taken • Improves performance: when branch is taken


• Fill the slot from fall through of the branch • Restriction: should be OK to execute instruction even if taken

Improves performance: when branch is not taken

Helping the Compiler

• Encode the compiler prediction in the branch instruction

- CPU knows whether branch was predicted taken or not taken by compiler- Cancel or nullify if prediction incorrect- Known as canceling or nullifying branch

• Options 2 and 3 can now be used without restrictions

Static Branch Prediction

• Predict-taken• Predict-untaken

• Prediction based on direction (forward/backward)• Profile-based prediction

Static Misprediction Rates

Some Remarks

• Delayed branches are architecturally visible

- Strength as well as weakness- Advantage: better performance- Disadvantage: what if implementation changes?

• Deeper pipeline ==> more branch delays ==> delay-slots may no longer be useful

- More powerful dynamic branch prediction

• Note: need to remember extra PC while taking exceptions/interrupts• Slowdown due to mispredictions: 1 + Branch frequency Misprediction rate Penalty

Further Issues in Pipelining

• Exceptions1• Instruction set issues• Multi-cycle operations

Exceptions and Pipelining

• What are exceptions?

• I/O interrupt• System call• Tracing instruction execution, breakpoint• Integer/FP anomaly• Page fault• Misaligned memory access• Memory protection violation• Undefined instruction• Hardware malfunction/Power failure

• Also called interrupts or faults

Exceptions: The Nemesis of Pipelining

• While taking exceptions, ensure that machine is in a "consistent" state• Exceptions can occur:

- In many pipeline stages- Out of order

CCI CC2 CC3 CC4 CC5 CC6LW IF ID EX MEM WBADD IF ID EX MEM WB

Classification of Exceptions

• Synchronous vs. Asynchronous

- Asynchronous usually caused by devices external to the processor- Asynchronous ==> can be handled after current instruction (easier)

• User requested vs. Coerced

- User requested ==> can be handled after current instruction- Coerced ==> unpredictable

• User maskable vs. Non-maskable• Within vs. Between instructions

- Within ==> instruction cannot be completed, usually synchronous (harder)

• Resume vs. Terminate

- Terminate process ==> easier

Exception Classification

Exception type Synchrounous?

Coereced? Maskable? Within instn.?

Resume?

I/O request No Yes No No YesSys. Call Yes No No No Yes

Tracing/Brk. Pt. Yes No Yes No YesALU excpn. Yes Yes Yes Yes YesPage fault Yes Yes No Yes Yes

Misaligned. Mem. access

Yes Yes Yes Yes Yes

Protecn. Violn. Yes Yes No Yes YesUndefined

instns.Yes Yes No Yes No

H/W malfn./ power failure

No Yes No Yes No

Restarting Execution

• Restartable: take exception, save state, restart without affecting execution• Restarting

- Force a trap instruction into pipeline- Until trap, disable all writes for faulting instruction and all subsequent ones- Trap into exception handling routine (OS)- Need to save more than one PC for delayed branches

• Precise Exceptions: all instructions prior to faulting one completed, but not any other

Exceptions in DLX

CCI CC2 CC3 CC4 CC5 CC6LW IF ID EX MEM WBADD IF ID EX MEM WB

• Exceptions can occur:

- In same cycle, or even out-of-order

• Cannot handle an exception when it occurs in time

- Carry an instruction status in the pipeline latches- In WB stage, exception corresponding to earliest instruction will be handled

More Complications in Pipelining

• Multiple write stages• Or, changing processor state in the middle of an instruction

- E.g., Auto-increment addressing mode in VAX

• Updating memory state during instruction

- E.g., String copy instruction in VAX

• Implicitly set condition codes

- Problems in scheduling the delay slot, and during exceptions

• Self-modifying code in 80x86!• Multi-cycle operations

MOVL R1, R2ADDL3 42(R1), 56(R1)+, @(R1)SUBL2 R2, R3MOVC3 @(R1)(R2), 74(R2), R3

Data hazards very complicated to determine!VAX pipelines micro-instructions

Pipelining Multi-cycle Opns.

• Some operations take > 1 cycle (e.g. FP)• Handling multi-cycle opns. in the pipeline:

- Multiple EX stages- Multiple functional units

• Two things to consider:

- Different units may take different # cycles- Some units may not be pipelined

• Corresponding definitions:

- Latency : # cycles between an instn. & another which can use its result- Initiation/repeat interval : # cycles between issue of two operations of the same type

The Multi-cycle Pipeline

Pipeline Timing: An Example

CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10 CC11MULTD IF ID M1 M2 M3 M4 M5 M6 M7 MEM WBADDD IF ID A1 A2 A3 A4 MEM WB

LD IF ID EX MEM

• Additional details:

- We require more latches- ID/EX register must be expanded

More Hazards!

• Structural hazards:

- Divide unit is not pipelined- Multiple writes possible in the same cycle

• Data hazards:

- RAW is more frequent- WAW is possible

• Control hazards:

- Out-of-order completion ==> difficulty in handling exceptions

Multiple Writes/Cycle: An Example

CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10 CC11MULTD F0,F4, F6

IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB

. IF ID EX MEM WB

. IF ID EX MEM WBADDDF2,F4,F6 IF ID A1 A2 A3 A4 MEM WB

. IF ID EX MEM WB

Multiple Writes/Cycle: Solution

• Provide multiple write-ports• Or, detect and stall; Two possibilities:

- Detect in ID stage

• Instruction reserve the write port using a reservation register• Reservation register is shifted one bit each clock

- Detect in ID stage

• Easier to check• Can also give priority to longer cycle operation• But, stall can now be in two places• Stall may trickle back

Handling WAW Hazards

• Occurs only when the result of ADDD is overwritten without any instruction using it!

- Otherwise, RAW hazard stall would have occurred

• Hazard can be detected in ID stage of latter instruction• Two ways to handle:

- Delay issue of load until ADDD enters MEM- Stamp out result of ADDD

Control Hazard Complications

• An example:

- DIVF F0, F2, F4 // Finishes last; excepn. - ADDF F10, F10, F8 // Finishes first - SUBF F12, F12, F14 // Finishes second

• Out-of-order completion causes problems!

- Precise exceptions are difficult to implement

Achieving Precise Exceptions

• Approach 1: Ostrich algorithm

- Don't care- May be provide a slower precise mode

• Example: special instructions to check for FP exceptions

• Approach 2: allow instruction issue to continue only if previous instructions willcomplete without exception

- Stall to maintain precise exceptions

• Approach 3: save state to undo

- Two possibilities

• History file: keep track of original value of registers• Future file: keep track of current value; main registerfile updated after all previous instructions are done

- More buffer space required- Hazard checks and control become very complex

• Approach 4: imprecise, but keep enough state for OS to recover

- Keep track of incomplete instructions- OS then runs those instructions before returning control- Complicated to execute these instructions properly!

Next Topic...

• Instruction Level Parallelism (ILP)

Instruction Level Parallelism

• Pipelining achieves Instruction Level Parallelism (ILP)

- Multiple instructions in parallel

• But, problems with pipeline hazards

- CPI = Ideal CPI + stalls/instruction- Stalls = Structural + Data (RAW/WAW/WAR) + Control

• How to reduce stalls?

- That is, how to increase ILP?

Techniques for Improving ILP

• Loop unrolling• Basic pipeline scheduling• Dynamic scheduling, scoreboarding, register renaming• Dynamic memory disambiguation• Dynamic branch prediction• Multiple instruction issue per cycle

- Software and hardware techniques

Loop-Level Parallelism

• Basic block: straight-line code w/o branches• Fraction of branches: 0.15• ILP is limited!

- Average basic-block size is 6-7 instructions- And, these may be dependent

• Hence, look for parallelism beyond a basic block• Loop-level parallelism is a simple example of this

Loop-Level Parallelism: An Example

• Consider the loop:

for(int i = 1000; i >= 1; i = i-1) {

x[i] = x[i] + C; // FP

}

- Each iteration of the loop is independent of other iterations- Loop-level parallelism

• To convert it into ILP:

- Loop unrolling ( static , dynamic)- Vector instructions

The Loop, in DLX

• In DLX, the loop looks like:

Loop: LD F0, 0(R1) // F0 is array elementADDD F4, F0, F2 // F2 has the scalar 'C'SD 0(R1), F4 // Stored resultSUBI R1, R1, 8 // For next iterationBNEZ R1, Loop // More iterations?

• Assume:

- R1 is the initial address- F2 has the scalar value 'C'- Lowest address in array is '8'

How Many Cycles per Loop?

CC1 Loop: LD F0, 0(R1)

CC2 stall

CC3 ADDD F4, F0, F2

CC4 stallCC5 stall

CC6 SD 0(R1), F4

CC7 SUBI R1, R1, 8

CC8 stall

CC9 BNEZ R1, Loop

CC10 stall

Reducing Stalls by Scheduling

CC1 Loop: LD LD F0, 0(R1)CC2 SUBI R1, R1, 8CC3 ADDD F4, F0, F2CC4 stallCC5 BNEZ R1, LoopCC6 SD 8(R1), F4

• Realizing that SUBI and SD can be swapped is non-trivial!• Overhead versus actual work:

- 3 cycles of work, 3 cycles of overhead

Unrolling the Loop

Loop: LD F0, 0(R1)ADDD F4, F0, F2SD 0(R1), F4 // No SUBI, BNEZLD F6, -8(R1) // Note diff FP reg, new offsetADDD F8, F6, F2SD -8(R1), F8LD F10, -16(R1) // Note diff FP reg, new offsetADDD F12, F10, F2SD -16(R1), F8LD F14, -24(R1) // Note diff FP reg, new offsetADDD F16, F14, F2SD -24(R1), F16SUBI R1, R1, 32

How Many Cycles per Loop?

Loop: LD F0, 0(R1) // 1 stalADDD F4, F0, F2 // 2 stallsSD 0 (R1), F4

LD F6, -8(R1) // 1 stall 28 cycles perADDD F8, F6, F2 // 2 stalls unrolled loopSD -8(R1), F8 ==LD F10, -16(R1) // 1 stall 7 cycles perADDD F12, F10, F2 // 2 stalls original loopSD -16(R1), F8LD F14, -24(R1) // 1 stallADDD F16, F14, F2 // 2 stallsSD -24(R1), F16SUBI R1, R1, 32 // 1 stall

Scheduling the Unrolled Loop

Loop: LD F0, 0(R1)LD F6, -8(R1)LD F10, -16(R1) 14 cycles perLD F14, -24(R1) unrolled loopADDD F4, F0, F2 ==ADDD F8, F6, F2 3.5 cycles perADDD F12, F10, F2 original loopADDD F16, F14, F2SD 0(R1), F4SD -8(R1), F8SUBI R1, R1, 32SD 16(R1), F8BNEZ R1, Loop

Observations and Requirements

• Gain from scheduling is even higher for unrolled loop!

- More parallelism is exposed on unrolling

• Need to know that 1000 is a multiple of 4• Requirements:

- Determine that loop can be unrolled- Use different registers to avoid conflicts- Determine that SD can be moved after SUBI, and find the offset adjustment

• Understand dependences

Dependences

• Dependent instructions ==> cannot be in parallel• Three kinds of dependences:

- Data dependence (RAW)- Name dependence (WAW and WAR)- Control dependence

• Dependences are properties of programs• Stalls are properties of the pipeline• Two possibilities:

- Maintain dependence, but avoid stalls- Eliminate dependence by code transformation

Data Dependence

Name Dependence

• Two instructions use the same register/memory (name), but there is no flow of data

- Anti-dependence: WAR hazard

- Output dependence: WAW hazard

• Can do register renaming - s tatically, or dynamically

Name Dependence in our Example

ILP: Recall

• Improving ILP == reducing stalls• Loop unrolling enlarges the basic block

- More parallelism- More opportunity for better scheduling

• Dependences:

- Data dependence- Name dependence- Control dependence

Handling Control Dependence

• Control dependence need not be maintained• We need to maintain:

- Exception behaviour - do not cause new exceptions- Data flow - ensure the right data item is used

• Speculation and conditional instructions are techniques to get around control dependence

Loop Unrolling: a Relook

• Our example:

for(int i = 1000; i >= 1; i = i-1) {

x[i] = x[i] + C; // FP

}

• Consider:

for(int i = 1000; i >= 1; i = i-1) {

A[i-1] = A[i] + C[i]; // S1

B[i-1] = B[i] + A[i-1]; // S2

}

- S2 is dependent on S1 - S1 is dependent on its previous iteration; same case with S2

• Loop-carried dependence ==> loop iterations have to be in-order

Removing Loop-Carried Dependence

• Another example:

for (int i = 1000; i >= 1; i = i-1) {

A[i] = A[i] + B[i]; // S1

B[i-1] = C[i] + D[i]; // S2

}

• S1 depends on the prior iteration of S2

- Can be removed (no cyclic dependence)

A[1000] = A[1000] + B[1000];

for(int i = 1000; i >= 2; i = i-1) {

B[i-1] = C[i] + D[i]; // S2

A[i-1] = A[i-1] + B[i-1]; // S1

}

B[0] = C[1] + D[1];

Static vs. Dynamic Scheduling

• Static scheduling: limitations

- Dependences may not be known at compile time- Even if known, compiler becomes complex- Compiler has to have knowledge of pipeline

• Dynamic scheduling

- Handle dynamic dependences- Simpler compiler- Efficient even if code compiled for a different pipeline

Dynamic Scheduling

• For now, we will focus on overcoming data hazards• The idea:

- DIVD F0 , F2, F4- ADDD F10, F0 , F8- SUBD F12, F8, F14

• SUBD can proceed without waiting for DIVD

CDC 6600: A Case Study

• IF stage: fetch instructions onto a queue• ID stage is split into two stages:

- Issue: decode and check for structural hazards- Read operands: check for data hazards

• Execution may begin, and may complete out of- order

- Complications in exception handling- Ignore for now

• What is the logic for data hazard checks?

The CDC Scoreboard

• Out-of-order completion ==> WAR and WAW hazards possible• Scoreboard: a data-structure for all hazard detection in the presence of out-of-order execution/completion• All instructions "consult" the scoreboard to detect hazards

The Scoreboard Solution

• Three components:

- Stages of the pipeline:

• Issue (ID1), Read-operands (ID2), EX, WB

- Data structure (in hardware)- Logic for hazard detection, stalling

Scoreboard Control & the Pipeline Stages

• Issue (ID1): decode, check if functional unit is free, and if a previous instruction has the same destination register

- No such hazard ==> scoreboard issues to the appropriate functional unit

• Note: structural/WAW hazards prevented by stalling here• Note: stall here ==> IF queue will grow

• Read operands (ID2):

- Operand is available if no earlier instruction is going to write it, or if the register is being written currently- RAW hazards are resolved here

• Execute (EX):

- Functional units perform execution- Scoreboard is notified on completion

• Write-Back (WB):

- Check for WAR hazards

• Stall on detection• Write-back otherwise

Some Remarks

• WAW causes stall in ID1, WAR causes stall in WB• No forwarding logic

- Output written as soon as it is available (and no WAR hazard)

• Structural hazard possible in register read/write

- CDC has 16 functional units, and 4 buses

The Scoreboard Data-Structures

• Instruction status• Functional unit status• Register result status• Randy Katz's CS252 slides... (Lecture 10, Spring 1996)

- Scoreboard pipeline control- A detailed example

Limitations of the Scoreboard

• Speedup of 1.7 for (compiled) FORTRAN, speedup of 2.5 for hand-coded assembly• Scoreboard only in basic-block!• Some hazards still cause stalls:

- Structural- WAR, WAW

Control Dependence

• An example:

T1;

if p1 {

S1;

}

• Statement S1 is control-dependent on p1, but T1 is not• What this means for execution

- S1 cannot be moved before p1- T1 cannot be moved after p1

Control Dependence in our Example

Dynamic Scheduling

• Better than static scheduling• Score boarding:

- Used by the CDC 6600- Useful only within basic block- WAW and WAR stalls

• Tomasulo algorithm:

- Used in IBM 360/91 for the FP unit- Main additional feature: register renaming to avoid WAR and WAW stalls

Register Renaming: Basic Idea

• Compiler maps memory --> registers statically• Register renaming maps registers --> virtual registers in hardware, dynamically• Should keep track of this mapping

- Make sure to read the current value

• Num. virtual registers > Num. ISA registers usually• Virtual registers are known as reservation stations in the IBM 360/91

Tomasulo: Main Architectural Features

• Reservation stations: fetch and buffer operand as soon as it is available• Load/store buffers: have the address (and data for store) to be loaded/stored• Distributed hazard detection and execution control• Common Data Bus (CDB): results passed from where generated to where needed• Note: IBM 360/91 also had reg-mem instns.

The Tomasulo Architecture

Pipeline Stages

• Issue:

- Wait for free Reservation Station (RS) or load/store buffer, and place instruction there- Rename registers in the process (WAR and WAW handled here)

• Execute (EX):

- Monitor CDB for required operand- Checks for RAW hazard in this process

• Write Result (WB):

- Write to CDB- Picked up by any RS, store buffer, or register

Register Renaming

• In RS, operands referred to by a tag (if operand not already in a register)• The tag refers to the RS (which contains the instruction) which will produce the required operand• Thus each RS acts as a virtual register

The Data Structure

• Three parts, like in the scoreboard:

- Instruction status- Reservation stations, Load/Store buffers, Register file- Register status: which unit is going to produce the register value

• This is the register --> virtual register mapping

Components of RS, Reg. File, Load/Store Buffers

• Each RS has:

- Op: the operation (+, -, x, /)- Vj, Vk: the operands (if available)- Qj, Qk: the RS tag producing Vj/Vk (0 if Vj/Vk known)- Busy: is RS busy?

• Each reg. in reg. file and store buffer has:

- Qi: tag of RS whose result should go to the reg. or the mem. locn. (blank ==> no such active RS)

• Load and store buffers have:

- Busy field, store buffer has value V to be stored

Maintaining the Data Structure

• Issue:

- Wait until: RS or buffer empty- Updates: Qj, Qk, Vj, Vk, Busy of RS/buffer; Maintain register mapping (register status)

• Execute:

- Wait until: Qj=0 and Qk=0 (operands available)

• Write result:

- CDB result picked up by RS (update Qj, Qk, Vj, Vk), store buffers (update Qi, V), register file (update register status)

- Update Busy of the RS which finished

Some Examples

• Randy Katz's CS252 slides... (Lecture 11, Spring 1996)• Dynamic loop unrolling example from text

Dynamic Loop Unrolling


• Assume branch predicted to be taken• Denote: load buffers as L1, L2..., ADDD RSs as A1, A2...• First loop: F0 --> L1, F4 --> A1• Second loop: F0 --> L2, F4 --> A2

Summary Remarks

• Memory disambiguation required• Drawbacks of Tomasulo:

- Large amount of hardware- Complex control logic- CDB is performance bottleneck

• But:

- Required if designing for an old ISA- Multiple issue ==> register renaming and dynamic scheduling required

• Next class: branch prediction

Dealing with Control Hazards

• Software techniques:

- Branch delay slots - Software branch prediction

• Canceling or nullifying branches

- Misprediction rates can be high - Worse if multiple issue per cycle

• Hence, hardware/dynamic branch prediction

Branch Prediction Buffer

• PC --> Taken/Not-Taken (T/NT) mapping• Can use just the last few bits of PC

- Prediction may be that of some other branch- Ok since correctness is not affected

• Shortcoming of this prediction scheme:

- Branch mispredicted twice for each execution of a loop

- Bad if loop is small

for(int i = 0; i < 10; i++) {

x[i] = x[i] + C;

}

Two-Bit Predictor

• Have to mispredict twice before changing prediction

- Built in hysteresis

• General case is an n-bit predictor

- 0 to (2^n)-1 saturating counter- 0 to (2^[n-1])-1 predict as taken- 2^[n-1] to (2^n)-1 predict as not-taken

• Experimental studies: 2-bit as good as n-bit

Implementing Branch Prediction Buffers

• Implementing branch prediction buffers

- Small cache accessed along with the instruction in IF- Or, additional 2 bits in instruction cache

• Note: branch prediction buffer not useful for DLX pipeline

- Branch target not known earlier than branch condition

Prediction Performance

• 4096 entries in the prediction buffer• SPEC89, IBM Power architecture

Improving Branch Prediction

• Two ways: increase buffer size, improve accuracy

Improving Prediction Accuracy

• Predict branches based on outcomes of recent other branches

if(aa == 2) {

aa = 0;

}

if(bb == 2) {

bb = 0;

}

if(aa == bb) {

// Do something

}

• Correlating, or two-level predictor

Two-Level Predictor

• There are effectively two predictors for each branch:

- Depending on whether previous branch is T/NT

Prediction bits

Prediction if last branch

NT

Prediction if last

branch TNT/NT NT NTNT/T NT TT/NT T NTT/T T T

• Last predictor was a (1,1) predictor

- One bit each of history, and prediction

• General case is (m,n) predictor

- m bits of history, n bits of prediction

• How to implement?

- Have an m-bit shift register

Cost of Two-Level Predictor

• Number of bits required:

- Num. branch entries x 2^m x n

• How many bits in 4096 (0,2) predictor?

- 8K

• How many branch entries for an 8K (2,2) predictor?

- 1K

Performance of (2, 2) Predictor

Branch Target Buffer

• Branch prediction buffer is not useful for DLX

- Need to know target address by the end of IF

• Store branch target address also

- Branch target buffer, or cache

• Access branch target buffer in IF cycle

- Hit ==> predicted branch target known at the end of IF

- We also need to know if the branch is predicted T/NT

Lookup based on PC

Predicated target

• No entry found ==> (Target = PC+4)• Exact match of PC is important

- Since we are predicting even before knowing that it is a branch instruction- Hardware is similar to a cache

• Need to store predicted PC only for taken predictions

Steps in Using a Target Buffer

Penalties in Branch Prediction

Buffer hit? Branch taken? Penalty

Yes Yes 0Yes No 2No - 2

• Given a prediction accuracy of p, a buffer hitrate of h , and a taken branch frequency of f , what is the branch penalty?

- h x (1-p) x 2 + (1-h) x f x 2

Storing Target Instructions

• Directly store instructions instead of target address

- Target buffer access is now allowed to take longer - Or, branch folding can be achieved

• Replace fetched instruction with that found in the target buffer entry• Zero cycle unconditional branch; may be conditional as well

Increasing ILP through Multiple Issue

• With at most one issue per cycle, min CPI possible is 1

- But there are multiple functional units- Hence use multiple issue

• Two ways to do multiple issue

- Superscalar processor

• Issue varying number of instructions per cycle• Static or dynamic scheduling

- Very Large Instruction Word (VLIW)

• Issue a fixed number of instructions

Superscalar DLX

• Simple version: two instructions issued per cycle

- One integer (load, store, branch, integer ALU) and one FP- Instructions paired and aligned on 64-bit boundaries - int fi rst, FP next

CC

CC

CC

CC4

CC5

CC

1 2 3 6

Integer IF ID E

X

MEM

WB

FP IF ID EX

MEM

WB

Integer IF ID EX

MEM

WB

FP IF ID EXMEM

WB

• No conflicts, almost...

- Assuming separate register sets, only FP load, store, move cause problems

• Structural hazard on register port• New RAW hazard between a pair of instructions

- Structural hazard:

• Detect, and do not issue the FP operation• Or, provide additional register ports

- RAW hazard:

• Detect, and do not issue the FP operation

• Also, result of LD cannot be used for 3 instns.

Static Scheduling in the Superscalar DLX: An Example


Loop: LD F0, 0(R1)LD F6, -8(R1)LD F10, -8(R1) ADDD F4, F0, F2LD F14, -8(R1) ADDD F8, F6, F2

LD F18, -8(R1) ADDD F12, F10, F2SD 0(R1), F4 ADDD F16, F14, F2SD -8(R1), F8 ADDD F20, F18, F2SD -16(R1), F12SUBI R1, R1, #40SD -24(R1), F16BNEZ R1, Loop

Dynamic Scheduling in the Superscalar DLX

• Scoreboard or Tomasulo can be applied• Should preserve in-order issue!

- Use separate data structures for Int and FP

• When the instruction pair has a dependence

- We wish to issue both in the same cycle- Two approaches:

• Pipeline the issue stage, so that it runs twice as fast• Exclude load/store buffers from the set of RSs

Multiple Issue using VLIW

• Superscalar ==> too much hardware

- For hazard detection, scheduling

• Alternative: let compiler do all the scheduling

- VLIW (Very Large Instruction Word)- E.g., an VLIW may include 2 Int, 2 FP, 2 mem, and a branch

Limitations to Multiple Issue

• Why not 10 issues per cycle? Why not 20?• Three limitations:

- Inherent ILP limitations in programs- Hardware costs (even for VLIW)

• Memory/register bandwidth

- Implementation issues:

• Superscalar: complexity of hardware logic• VLIW: increased code size, binary compatibility problems

Support for ILP

• Software (compiler) support• Hardware support• Combination of both

Compiler Support for ILP

• Loop unrolling:

- Dependence analysis is a major component- Analysis is simple when array indices are linear in the loop variable (called affine indices)

• Limitations to dependence analysis:

- Pointers- Indirect indexing- Analysis has to consider corner cases too

• Two important techniques:

- Software pipelining- Trace scheduling

• Software pipelining: reorganize a loop such that each iteration is made from instructions chosen fromdifferent iterations of the originalloop

Software Pipelining

Software Pipelining in Our Example


Iter i: LD F0, 0(R1)ADDD F4, F0, F2 Software Pipelined LoopSD 0(R1), F4 SD 0(R1), F4

Iter i+1: LD F0, 0(R1) Loop: SD 16(R1), F4ADDD F4, F0, F2 ADDD F4, F0, F2SD 0(R1), F4 LD F0, 0(R1)

Iter i+2: LD F0, 0(R1) SUBI R1, R1, 8ADDD F4, F0, F2 BNEZ R1, Loop

Trace Scheduling

Hardware Support for Speculation

• Conditional or predicated instructions• Execute on condition, annul otherwise• Example: conditional move

if (A == 0) {S = T;}

BNEZ R1, LCMOVZ R2, R3, R1

MOV R2, R3

L:

• Control dependence has been eliminated

- Dependence resolution now moves to WB stage

Scheduling Using Conditional Instructions

LW R1, 40(R2) ADD R3, R4, R5

ADD R6, R3, R7

BEQZ R10, L

LW R8, 20(R12)

LW R9, 0(R8)

LW R1, 40(R2) ADD R3, R4, R5

LWC R8, 20(R12), R10 ADD R6, R3, R7

BEQZ R10, L

LW R9, 0(R8), R12

Empty slot filled, stall for last load eliminated

Limitations of Conditional Instructions

• Usefulness limitations:

- Condition must not be delayed due to data dependence- Useful only for simple alternative sequences

• Performance limitations:

- Annulled conditional instruction is equivalent to noop /stall

• Except when filling an anyway empty slot

- Speed penalty in terms of higher clock cycle time

Speculation

• Wish to move instructions across branches

- To eliminate possible stalls- For better scheduling- Appropriate conditional instructions may not always exist

• Example:

if (N == 0) {

A = *X;

} else {

A++;

}

Speculation: An Example

LW R1, 0(R2) // Load NBNEZ R1, L1 // Test N LW R1, 0(R2) // Load NLW R3, 0(R4) // Load X LW R3, 0(R4) // Load X

LW R5, 0(R3) // Load *X LW R5, 0(R3) // Load *X

JMP L2 // Skip else BEQZ R1, L3 // Test N L1: LW R5, 0(R6) // Load A LW R5, 0(R6) // Load A

ADDI R5, R5, #1 // Incrmt . ADDI R5, R5, #1 // Incrmt L2: SW 0(R6), R3 // Store A L2: SW 0(R6), R5 // Store A

• Compiler predicts that the "then" clause is most likely• Speculatively schedules the "then" clause

- Eliminates 2 stalls, and the JMP instruction

Exception Behaviour

LW LWLW R3, 0(R4) // Load XLW R5, 0(R3) // Load *XBEQZ R1, L3 // Test NLW R5, 0(R6) // Load AADDI R5, R5, #1 // Incrmt .

L2: SW 0(R6), R5 // Store A

• Terminating vs. non-terminating exceptions• While doing such scheduling:

- Correct program ==> no extra terminating exceptions- Incorrect program ==> should preserve any terminating exceptions

Preserving Exception Behaviour

• Approach 1: ignore terminating exceptions for speculated instructions

- Incorrect programs may not be terminated

LW R1, 0(R2) // Load NLW* R3, 0(R4) // Load X, speculatedLW* R5, 0(R3) // Load *X, speculated

BEQZ R1, L3 // Test NLW R5, 0(R6) // Load AADDI R5, R5, #1 // Incrmt .


• Approach 2: poison bits

- Set poison bit in result register of conditional instruction, if exception occurs- Raise exception if any other instruction uses that register

LW R1, 0(R2) // Load N

LW* R3, 0(R4) // Load X, set poison bit on exception

LW* R5, 0(R3) // Load *X, set poison bit on exception

BEQZ R1, L3 // Test NLW R5, 0(R6) // Load AADDI R5, R5, #1 // Incrmt .


• Extra register R10 gets used up• Extra instruction in "else" clause

• Approach 3: buffer results

- Instructions boosted past branches, flagged as boosted (in opcode )- Results of boosted instructions forwarded and used, like in Tomasulo- When branch is reached, result of speculation is checked

• Result committed if prediction correct• Result discarded otherwise

- Solution close to fully hardware-based speculation

Boosted Instructions: An Example

LW R1, 0(R2) // Load Nif (N == 0) { LW+ R3, 0(R4) // Load X, boosted

A = *X; ADDI+ R1, R1, #1 // N++, boosted

N++; LW+ R5, 0(R3) // Load *X, boosted

} else { BEQZ R1, L3 // Test N

A++; LW R5, 0(R6) // Load A} ADDI R5, R5, #1 // Incrmt .


• The "+" denotes a boosted instruction, and is boosted across the next branch,which is predicted taken

Hardware-Based Speculation

• Combination of branch prediction, speculation, and dynamic scheduling• Data flow execution : instruction executes as soon as the data it requires is ready• Advantages over software approach:

- Memory disambiguation is better- Better branch prediction- Precise exception model- No book-keeping code- Works for "old " software too

• Disadvantage: hardware cost and complexity

Speculation in Tomasulo

• Speculate using branch prediction• Go ahead and execute based on speculation• Use results of speculated instructions for other instructions, just as in Tomasulo• But, commit result only after knowing if speculation was correct

- In-order commit- Using reorder buffer- Also achieves precise exceptions

The Reorder Buffer

• Similar to the store buffer in functionality• Replaces the store and load buffers• Virtual registers are the reorder buffer entries

- The reservation stations are not virtual registers anymore

• Reorder Buffer Data Structure

- Instruction type: branch, store, or ALU/load- Destination: register or memory location- Value: which has to be committed

Tomasulo Using the Reorder Buffer

Pipeline Stages

• Issue, EX, WB, Commit• Issue allocates a reorder buffer entry

- Entries allocated in circular fashion

• Commit writes result back to destination

- Frees up the reorder buffer entry- For branch instruction

• Prediction correct ==> commit• Else, flush reorder buffer

Summary of ILP Techniques

• Software techniques

- Compiler scheduling, Loop unrolling, Software pipelining, Trace scheduling (VLIW), Static branch prediction, Speculation

• Hardware support for software

- Conditional instructions, poison bits

• Hardware techniques

- Hardware scheduling, Dynamic branch prediction, Hardware speculation

• Which hardware technique(s) to use?

How Much ILP is Available?

• Assume infinite hardware resources

- Infinite virtual registers- Perfect branch prediction, jump prediction- Perfect memory disambiguation

• Every instruction is scheduled as early as possible

- Restricted only by data flow

Available ILP in Programs

Window Size Limitation

Effect of Imperfect Branch Predictions

Effect of Finite Virtual Register Set

A Realizable Processor

• Up to 64 instruction issues per cycle• Selective predictor with 1K entries, and a 16-entry return predictor• Perfect memory disambiguation• Register renaming with 64 integer virtual registers, and 64 FP virtual registers

ILP for a Realizable Processor

Memory Hierarchy

• Two principles:

- Smaller is faster - Principle of locality

• Processor speed grows much faster than memory speed• Registers - Cache - Memory - Disk

- Upper level vs. lower level

• Cache design

Cache Design Questions

• Cache is arranged in terms of blocks

- To take advantage of spatial locality

• Design choices:

- Q1: block placement - where to place a block in upper level? - Q2: block identification - how to find a block in upper level? - Q3: block replacement - which block to replace on a miss? - Q4: write strategy - what happens on a write?

Block Placement: Fully Associative

Block Placement: Direct

Block Placement: Set Associative

Continuum of Choices

• Memory has n blocks, cache has m blocks• Fully associative is the same as set associative with one set ( m -way set associative)

• Direct placement is the same as 1-way set associative (with m sets)• Most processors use direct, 2-way/4-way set associative

Block Identification

• How many different blocks of memory can be mapped (at different times) to a cache block?• Fully associative: n• Direct: n/m• k-way set associative: k*n/m• Each cache block has a tag saying which block of memory is currently present in it

- A valid bit is set to 0 if no memory block is in the cache block currently

• How many bits for the tag?

• How many sets in cache?

m/k

• How many bits to identify the correct set?

• How many blocks in memory?

repr esent block number in memory

• Given a memory address:

- Select set using index , block from set using tag - Select location from block using block offset - tag + index = block address

Block Replacement Policy

• Cache miss ==> bring block onto cache

- What if no free block in set? - Need to replace a block

• Possible policies:

- Random - Least-Recently Used (LRU)

• Lesser miss-rate, but harder to implement

Replacement Policy Performance

Write Strategy

• Reads are dominant

- All instructions are read - Even for data, loads dominate over stores

• Reads can be fast

- Can read from multiple blocks while performing tag comparison - Cannot do the same with writes

• Should pay attention to write performance too!

When do Writes go to Memory?

• Write through: each write is mirrored to memory also

- Easier to implement

• Write back: write to memory only when block is replaced

- Faster writes - Some writes do not go to memory at all! - But, read miss may cause more delay

• Block being replaced has to be written back • Optimize using dirty bit

- Also, bad for multiprocessors and I/O

Write Stalls

• In write through, may have to stall waiting for write to complete

- Called a write stall - Can employ a write buffer to enable the processor to proceed during the write-through

What to do on a Write Miss?

• Write-allocate (or, fetch on write): load block on a cache miss during a write• No-write allocate (or, write around): just write directly to main memory• Write-allocate usually goes with write-back, and no-write allocate goes with write-through

The Alpha AXP 21064 Cache

• 34-bit physical address

- 29 bits for block address - 5 bits for block offset

• 8 KB cache, direct-mapped

- 8 bits for index - 29 - 8 = 21 bits for tag

Steps in Memory Read

• Four steps:

- Step-1: CPU puts out the address - Step-2: Index selection

- Step-3: Tag comparison, read from data - Step-4: Data returned to CPU (assuming hit)

• This takes two cycles

Steps in Memory Write

• Write-through policy is used• Write buffer with four entries

- Each entry can have up to 4 words from the same block - Write merging: successive writes to the same block use the same write-buffer entry

Some More Details

• What happens on a miss?

- Cache sends signal to CPU asking it to wait - No replacement policy required (direct mapped) - Write miss ==> write-around

• 8KB separate instruction cache

Separate versus Unified Cache

• Direct-mapped cache, 32-byte blocks, SPEC92, on DECstation 5000• Unified cache has twice the size of Icache or D-cache• 75% instruction references

I-Cache D-Cache U-Cache

1KB 3.06% 24.61% 13.34%

2KB 2.26% 20.57% 9.78%

4KB 1.78% 15.94% 7.24%

8KB 1.10% 10.19% 4.57%

16KB 0.64% 6.47% 2.87%

32KB 0.39% 4.82% 1.99%

64KB 0.15% 3.77% 1.35%

128KB 0.02% 2.88% 0.95%

Miss-rates

Cache Performance

• Miss rate is an important metric

- But not the only one

Avg. mem. access time = Hit time + Miss rate X Miss penalty

• Hit time, Miss penalty can be expressed

- In absolute terms, - Or, in terms of number of clock cycles

• Miss rate decrease may imply reduced performance

- Example: unified vs. split cache

CPU Performance, with Cache

CPU time = ( CPU cycles + Mem. stall cycles) X Cycle time

Mem. Stalls = Reads X Read miss rate X Read miss penalty + Writes X Write miss rate X Write miss penalty

CPU time = IC X Cycle time X ( CPI + X Miss rate X Miss penalty)

Effect of Cache on Performance

• Some typical values:

- CPI = 1 - Mem. per instn. = 1.35 - Miss rate = 2% - Miss penalty = 50

• Mem. stalls comparable to CPI!

- Cache behaviour is an important component of performance - More important for lower CPI

Improving Cache Performance

Avg. mem. access time = Hit time + Miss rate X Miss penalty

• Three possibilities:

- Reduce miss rate - Reduce miss penalty - Reduce hit time

• Beware of slowing down the CPU!• Example:

- Set associative ==> potentially higher cycle time

Cache Misses: The Three C's

• Compulsory: first access to a block

- Also called cold start, or first reference misses

• Capacity: misses due to cache being small • Conflict: two memory blocks mapping onto the same cache block

- Also called collision, or interference misses

The Three C's

Cache size

Associat ivity

Compulsory

Capacity

Conflict

Total Frac.Compulsory

Frac.Capacity

Frac.Conflict

1KB 1-way 0.20% 8.00% 5.20% 13.40%

0.01 0.6 0.39

1KB 2-way 0.20% 8.00% 2.30% 10.50 0.02 0.76 0.22

%1KB 4-way 0.20% 8.00% 1.30% 9.50

%0.02 0.84 0.14

1KB 8-way 0.20% 8.00% 0.50% 8.70%

0.02 0.92 0.06

2KB 1-way 0.20% 4.40% 5.20% 9.80%

0.02 0.45 0.53

2KB 2-way 0.20% 4.40% 3.00% 7.60%

0.03 0.58 0.39

2KB 4-way 0.20% 4.40% 1.80% 6.40%

0.03 0.69 0.28

2KB 8-way 0.20% 4.40% 0.80% 5.40%

0.04 0.81 0.15

4KB 1-way 0.20% 3.10% 3.90% 7.20%

0.03 0.43 0.54

4KB 2-way 0.20% 3.10% 2.40% 5.70%

0.04 0.54 0.42

4KB 4-way 0.20% 3.10% 1.60% 4.90%

0.04 0.63 0.33

4KB 8-way 0.20% 3.10% 0.60% 3.90%

0.05 0.79 0.15

8KB 1-way 0.20% 2.30% 2.10% 4.60%

0.04 0.5 0.46

8KB 2-way 0.20% 2.30% 1.30% 3.80%

0.05 0.61 0.34

8KB 4-way 0.20% 2.30% 1.00% 3.50%

0.06 0.66 0.29

8KB 8-way 0.20% 2.30% 0.40% 2.90%

0.07 0.79 0.14

16KB

1-way 0.20% 1.50% 1.20% 2.90%

0.07 0.52 0.41

16KB

2-way 0.20% 1.50% 0.50% 2.20%

0.09 0.68 0.23

16KB

4-way 0.20% 1.50% 0.30% 2.00%

0.1 0.75 0.15

16KB

8-way 0.20% 1.50% 0.20% 1.90%

0.11 0.79 0.11

32KB

1-way 0.20% 1.00% 0.80% 2.00%

0.1 0.5 0.4

32KB

2-way 0.20% 1.00% 0.20% 1.40%

0.14 0.71 0.14

32KB

4-way 0.20% 1.00% 0.10% 1.30%

0.15 0.77 0.08

32K 8-way 0.20% 1.00% 0.10% 1.30 0.15 0.77 0.08

B %64KB

1-way 0.20% 0.70% 0.50% 1.40%

0.14 0.5 0.36

64KB

2-way 0.20% 0.70% 0.10% 1.00%

0.2 0.7 0.1

Reducing Cache Misses

• Capacity: increase cache size

- Thrashing can happen otherwise

• Conflict: increase associativity

- But, greater complexity, slower hit time

• Compulsory: increase block size

- But, greater miss penalty!

Technique-1: Larger Blocks

• Reduces compulsory misses

- By improving spatial locality

• Increases miss penalty• Also, may increase conflict/capacity misses

Cache size 1KB 4KB 16KB 64KB 256KB

Block size

16B 15.05% 8.57% 3.94% 2.04% 1.09%

32B 13.34% 7.24% 2.87% 1.35% 0.70%

64B 13.76% 7.00% 2.64% 1.06% 0.51%

128B 16.64% 7.78% 2.77% 1.02% 0.49%

• Miss penalty depends on:

- Memory latency, memory bandwidth

• Assuming latency of 40 cycles, and bandwidth of 16 bytes per 2 cycles, AMAT values are:

Cache size 1KB 4KB 16KB 64KB 256KB

Block size MissPenalty

16B 42 7.32 4.6 2.66 1.86 1.46

32B 44 6.87 4.19 2.26 1.59 1.31

64B 48 7.61 4.36 2.27 1.51 1.25

128B 56 10.31 5.35 2.55 1.57 1.27

Technique-2: Higher Associativity

• Reduces conflict misses• But, increases hit time• 8-way as good as fully associative• Rule of thumb:

- Direct mapped cache of size N has the same miss rate as a 2-way cache of size N/2

Technique-3: Victim Cache

• Small cache of "victim" blocks, which were thrown out recently

- Fully associative

• Reduces conflict misses• Does not affect cycle time, or miss penalty• Study: 4-entry victim cache removed 20-95% of conflict misses in a 4KB direct mapped cache

Technique-4: Pseudo- Associative Cache

• Also called column associative• Hit proceeds just as in a direct-mapped cache• Miss ==> check in set (by flipping MSB of index)• May need to swap contents in the set

Miss rate pseudo = Miss rate 2-way

Miss penalty pseudo = Miss penalty 1-way

Hit time pseudo = Hit time 1-way + Alt. hit rate X k

Alt. hit rate = Miss rate 1-way - Miss rate 2-way

Technique-5: Hardware Prefetching

• Fetch more than required, on a miss

- Prefetch into cache, or another small buffer (faster than memory)

Avg. mem. access time = Hit time + Miss rate X Prefetch hit rate X k + Miss rate X Prefetch miss rate X Miss penalty

Technique-6: Compiler Controlled Prefetch

• Special instructions for prefetching data

- Non-faulting instructions are most useful - CPU should be able to proceed in parallel with cache

• Non-blocking cache

• Example:

for (i = 0; i < 3; i++) {

for(j = 0; j < 100; j++) {

a[i][j] = b[j][0] * b[j+1][0];

}

}

Technique-7: Compiler Optimizations

• Merging arrays

int val[1000]; struct merge { int val; int key; };

int key[1000]; struct merge M[1000];- Improves spatial locality

• Loop interchange

for(j = 0; j < 100; j++) for(i = 0; i < 100; i++)for(i = 0; i < 100; i++) for(j = 0; j < 100; j++)

x[i][j] = 0; x[i][j] = 0;- Improves spatial locality

• Loop fusion

for(i = 0; i < 100; i++)for(j = 0; j < 100; j++) for(i = 0; i < 100; i++)

a[i][j] = b[i][j] + c[i][j]; for(j = 0; j < 100; j++)

a[i][j] = b[i][j] + c[i][j];for(i = 0; i < 100; i++) d[i][j] = 2*a[i][j];

for(j = 0; j < 100; j++)d[i][j] = 2*a[i][j];

- Improves temporal locality

• Blocking: operate on small blocks of matrices

- Improves temporal locality

Miss-Rate Reduction: Summary

• Larger blocks • Higher associativity • Victim cache • Pseudo-associativity • Hardware prefetching • Software controlled prefetching • Code optimization by compiler

Technique-1: Prioritize Read Misses over Writes

• Write-through cache ==> write-buffer

- Beware of consistency- Example: store x, load y, load x - x and y in the same block

• Possible solution: wait for write-buffer to clear before processing any read miss• Better (but more complex) solution: check write buffer, and process read miss first• Write-back cache: write-back dirty block after processing read miss

Technique-2: Sub-Block Placement

• Sub-block: units smaller than the full block

- Valid bits added to sub-blocks- Only a sub-block read on cache miss

• How is this different from just using a smaller block size?

- Tag length is reduced (good for on-chip cache)

Technique-3: Restart CPU ASAP

• Early restart: CPU can proceed as soon as the requested word is loaded onto cache• Critical word first: The requested word is fetched first

- A.k.a wrapped fetch, or requested word first

• These are good for caches with large blocks• What if another access to same block, before it is fully loaded?

-Stall if that portion of block not yet loaded

Technique-4: Non-blocking Cache

• For OOO CPUs (e.g. Tomasulo)

- No point in stalling the CPU on a miss - Hit-under-miss allows hits while the cache is processing a miss - Hit-under-multiple-miss can benefit more - Miss-under-miss makes sense if main memory can handle more than one request in parallel

• This significantly increases complexity of cache controller

Non-Block Cache Performance

Technique-5: Second-Level Caches

• L1 cache can be small and fast• L2 cache can be larger, but faster than main memory

Avg. mem. access time = Hit time L1 + Miss rate L1 Miss penalty L1

Miss penalty L1 = Hit time L2 + Miss rate L2 Miss penalty L2

• Local miss rate: misses w.r.t. memory accesses to this cache• Global miss rate: misses w.r.t. memory access by CPU

Local and Global Miss Rates

Second Level Cache Design

• L2 can be larger

- Big enough to virtually eliminate capacity misses

• Higher associativity does not hurt

- CPU clock cycle time is not affected

• Larger block size to further reduce misses• Multi-level inclusion property: L2 contains all data that L1 contains

- More work on a second-level miss

Small and Simple Caches

• Keep the cache small

- Faster - Can fit inside processor - Trade-off: tags within processor, data outside

• Keep the cache simple

- Direct-mapped ==> tag comparison can be in parallel with data transmission

Other Techniques

• Faster writes: pipeline writes

- Split the tag and data storage in cache - Pipeline stage-1: tag access and comparison - Pipeline stage-2: write data

• Dealing with virtual address --> physical address translation

- Avoid it (virtually addressed caches) - In parallel with cache access (virtually indexed, physically tagged cache)

Virtual Memory

• Another level in the hierarchy• Uses of virtual memory:

- Level of indirection

• No program overlays required • Easy relocation

- Sharing and protection

Just like Memory-->Cache in Functionality...

• Cache line • Page or segment

• Cache miss • Page fault

• Memory --> cache mapping

• VA --> PA mapping (address translation)

But Quite Different Quantitatively...

Parameter Memory --> Cache VM --> Ph. Memory

Hit time 1-2 Cycles 40-100 Cycles

Miss penalty 8-100 Cycles O(10ms-100ms)

Miss rate 0.5-10% 0.00001-0.001%

Block/Page size 16-128 Bytes 4-64KB

Upper level size 16KB-1MB O(1GB)

• Page faults handled in software

- Be very careful about what you discard - Lots of time anyway

• VM size determined by ISA• VM is not quite the hard-disk...

Paging versus Segmentation

Criterion Paging Segmentation

Block size Uniform (4 to 64KB) Variable (max :2^16-2^32,

Min: 1byte)Words per address

One Two

Programmer visible?

No Perhaps

Block replacement

Easy Need to find contiguous

memory

Memory use

inefficiency

Internal fragmentation External fragmentation

Efficient disk traffic?

Usually yes (for

appropriate page size)

Not for small segments

• Other possibilities:

- Paged segments - Choices for page size

The Four Memory Hierarchy Questions

• Where to place a block?

- Fully associative

• How to find a block in main memory?

- Page table, or inverted table; cached in TLB

• Which block to replace?

- LRU, with the help of a use/reference bit

• What happens on write?

- Write-back, write-allocate

Trade-Offs in Page-Size

• Large page size good for: • Smaller page size good for

- Smaller page tables - Efficient use of memory (lesser fragmentation)

- Lesser TLB miss rate -Faster process startup time

- Efficient disk or network transfer- Faster cache hits (how?)

Fast Translation

• Translation Look-aside Buffer (TLB)

- Small table in hardware - Fully associative - Fields:

• The translation, valid bit, use bit, dirty bit, protection bits

• TLB access can be in critical path

- Pipeline TLB access - Overlap cache tag access with translation!

Overlapping Tag Access with Translation

Cache index Block offset

Page number Page offset

Tag access through index is independent of translation

Cache index is virtual, but tags are physical

This limits cache size potentially

Solutions possible:

Higher associativity

Page colouring (set associativity)

Small guessing hardware

Alternate Strategy: Avoid Translation!

• Virtually addressed caches:

- Cache is accessed using the virtual address

• Advantage: faster hit time

• Disadvantages:

- Cache has to be flushed on process switch - What if two different VAs for the same PA?

• Synonyms/aliases

- I/O usually uses PA to access memory/cache

Dealing with Virtually Addressed Caches

• Avoiding cache flush:

- Include a PID field in cache tag

• Anti-aliasing

- Page colouring (set associativity) - Create "enough" colours (sets) to ensure that cache size <= block-size x number of sets

- Cache has to be direct mapped

Main Memory

• DRAM versus SRAM

- DRAM is cheaper, but slower

• Reducing the number of pins

- At the cost of some performance - Address = RAS + CAS

• Performance metrics: latency and bandwidth

- #cycles to send address - #cycles to access a word - #cycles to send the data word

Main Memory Performance: One-Word Wide Memory

Suppose,#cycles to send address=4#cycles to access 1 word =24#cycles to send data word=4

Cache line=4words

What is the miss penalty?

4 x (4 + 24 + 4) = 128 cycles

Technique-1: Wider Memory

What is the miss penalty now?

2x(4+24+4)=64 cycles

Disadvantages?

• Larger bus width (cost)

• Unit of memory addition is larger

• Read-modify-write for single-byte write, if error-correction present

Technique-2: Interleaved-Memory

T echnique-3: Independent Memory Banks

• Multiple independent accesses

-Separate address and data lines

• Needed for miss-under-miss scheme• Also, parallel I/O with CPU• Each independent bank may itself be interleaved

- Super-bank number and bank number

Memory-Bank Conflicts

• Code can often be such that memory-bank conflicts occur

- No use of independent memory bank organization under such conflicts

• Example:

int x[2][512];

for(j = 0; j < 512; j++) {

for(i = 0; i < 2; i++) {

x[i][j]++;

}

}

Technique-4: Avoiding Memory-Bank Conflicts

• Software solutions:

- Loop interchange (works for this example) - Expand array size so that it is not a power of two

• Hardware solution:

- Use prime number of banks

Bank num = Addr % # banks Addr within bank = Addr /# banks

Addr within bank = Addr # words within bank if #words within bank, and # banks are co-prime

Technique-5: DRAM-Specific Interleaving

• DRAM has RAS and CAS

- Usually RAS and CAS are given one after another - Same RAS can be used to read multiple columns- DRAMs come with separate signals to allow such access

Now, various remarks before finishing up with memory-hierarchy design

Virtual Memory and Protection

• OS requires support in terms of:

- Two modes (at least) of execution: user , supervisor/kernel - Some CPU state which is readable but not writable in user mode

• TLB• User/supervisor mode bit

- Mechanisms to switch between the modes

• System calls

ILP and Caching

• Superscalar execution:

- Cache must have enough ports to match the peak bandwidth - Hit-under-miss, Miss-under-miss required

• Speculative execution:

- Suppress exception on speculative instructions - Don't stall the cache on a speculative instruction cache miss

ILP vs. Caching: Compiler Choices

int x[32][512]; int x[32][512];for(j = 0; j < 512; j++) { for(i = 0; i < 32; i++) {

for(i = 0; i < 32; i++) { for(j = 0; j < 512; j++) {x[i][j] = 2*x[i][j-1]; x[i][j] = 2*x[i][j-1];} }

} }

Caches and Consistency

• I/O using caches?

- Interferes with CPU, may throw useful blocks

• I/O using main memory

- Write-through ==> No problem for CPU output - What about input?

• Approach-1: OS marks memory block as non-cacheable • Approach-2: OS flushes the cache block after input • Approach-3: h/w checks if block is present in cache, invalidate if cached (parallel set of tags for perf.)

• Multi-processors - want same data in many caches: cache-coherence problem

Why Multiprocessors?

Motivation: Opportunity :Go beyond the performance offered by a single processor Software available

Without requiring Specialized processors Parallel programs

Without the complexity of too much multiple issue

Multi-programmed Machines

Multiprocessors: The SIMD Model

• SISD: Single Instruction stream, Single Data stream

- Uniprocessor - This is the view at the ISA level - Tomasulo uncovers data stream parallelism

• SIMD: Single Instruction stream, Multiple Data streams

- ISA makes data parallelism explicit - Special SIMD instructions - Same instruction goes to multiple functional units, but acts on different data

SIMD Drawbacks

SIMD useful for loop-level parallelism Model is too inflexible to accommodate parallel programs as well as multiprogrammed environments Cannot take advantage of uniprocessor performance growth SIMD architecture usually used in special purpose designs Signal or image processing

Multiprocessors: The MIMD Model

• MIMD: Multiple Instruction streams, Multiple Data streams

- Each processor fetches its own instruction and data

• Advantages:

- Flexibility: parallel programs, or multiprogrammed OS, or both - Built using off-the-shelf uniprocessors

MIMD: The Centralized Shared-Memory Model

Single bus connects a shared memory to all processors Also called Uniform Memory Access (UMA) machine Disadvantage: cannot scale very well, especially with fast processors (more memory bandwidth required)

MIMD: Physically Distributed Memory

Independent memory for each processor High-bandwidthInterconnection Adv: cost-effective memory bandwidth scalingAdv: lesser latency for local access Disadv: communication of data between nodes

Communication Models with Physically Distributed Memory

• Distributed Shared Memory (DSM)

- Memory address space is the same across nodes - Also called scalable shared memory - Also called NUMA: non-uniform memory access- Communication is implicit via load/store

• Multicomputer, or Message Passing Machine

- Separate private address spaces for each node - Communication is explicit, through messages - Synchronous, or asynchronous - Std. Message Passing Interface (MPI) possible

Multiprocessing: Classification

Multiprocessing:

Classification

DSM vs. Message Passing

Shared Memory Message Passing

Well understood mechanisms for programming

Hardware simplicity

Program independent of communication pattern

Communication is explicit - forces programmer to pay attention to what is expensive

Low overhead for communicating small itemsHardware controlled caching

Achieving the Desired Communication Model

Message Passing on top of Shared Memory Considerable easier

Difficulty arises in dealing with arbitrary message lengths

Shared Memory on top of Message Passing

Harder since every load/store has to be faked

Every memory reference may involve OS

One promising direction: use of VM to share objects at page level: shared VM

Challenges in Parallel Processing

• Limited parallelism available in programs

- 90% parallelizable ==> max speed possible? - Exception: super-linear speedup

• Increased memory/cache available • Usually not very great however

• Large latency of communication

- 50-10000 clock cycles - 0.5% instructions access remote memory ==> what is the increase in CPI?

Addressing the Challenges

• Limited parallelism

- Tackled mainly by redesigning the algorithm or software

• Avoiding large latency

- Hardware mechanism: caching - Software mechanism: restructure to make more accesses local

Some Example Applications

• Two classes

- Parallel programs or program kernels - Multi-programmed OS

• Spatial and temporal data access patterns are important• Computation to communication ratio is important

Parallel Application Kernels

• The FFT kernel

- Used in spectral methods - Data represented as array - Computation involves

• 1D FFT on each row • Transpose • 1D FFT on each row again

- Each processor gets a few rows of data - Main communication step is the transpose (all to all communication)

• The LU kernel

- LU factorization of a matrix - Blocking is used - Computation (dense matrix multiply) is performed by processor which owns the destination block - Communication happens at regular intervals

Parallel Applications

• Barnes application

- N-body problem - Octree representation - Each processor is allocated a sub tree - Tree expansion as required (communication in this process)

• Ocean application

- Influence of eddy and boundary currents on ocean flows - Involves solving PDEs - Ocean divided into hierarchy of grids (finer grid for more accuracy) - Each processor gets a set of grids - Communication to exchange boundary conditions, at each step of the process

Computation to Communication Ratios

Application Computation scalingCommunication scalingScaling of computation tocommunication

FFT nlogn/p n/p Logn

LU n/p sqrt(n/p) sqrt(n/p)

Barnes nlogn/p logn*sqrt(n/p) sqrt(n/p)

Ocean n/p sqrt(n/p) sqrt(n/p)

Multiprogrammed OS workload

• Workload used here is:

- Two independent copies of the compilation of the Andrew benchmark - Three steps:

• Compilation: compute intensive • Installing object files in a library: I/O intensive• Removing the object files: I/O intensive

Cache Coherence

• In what kind of multi-processors do we need cache coherence?

• What are the kinds of data which are cached?

- Shared (read) data - replication - Private data - migration

Notions of Coherence and Consistency

• Coherence:

- Program order preservation within a processor - Write by P1, Read by P2, if "sufficiently separated", should get the value of write - Write serialization: same order of writes seen by all processors

• Specifying when a read should get the value of a write: memory consistency model

Styles of Coherence Protocols

• Directory-based: central directory maintains the "status" of each block

• Snooping-based:

- In a centralized shared memory machine - Each processor snoops on the common bus - Also maintains the "status" of a block locally (no central directory) - Snooping helps maintain coherence

Styles of Snooping Protocols

• Write-invalidate: processor makes sure that it has the only copy of a block before writing

- Invalidates other copies by sending an invalidate command on the bus

• Write-update or write-broadcast:

processor updates all copies of a block when it writes

- Send the written data on the common bus

Write-invalidate vs. Write-update

Write-invalidate Write-updateConsecutive writes to a location does not cause repeated traffic on bus

Writes appear for readers with lesser latency

Writes to consecutive locations does not cause extra traffic on busSnooping-Based Protocols

• Applicable for write-through as well as write back caches • Optimizations:

- Shared/Exclusive bit - Write miss and invalidate messages - Shared + single versus truly shared blocks - Maintaining separate tags

• Dirty bit required?

Towards Directory-Based Protocols

• What properties of the multi-processor does a snooping-based protocol use?

- Broadcast-based bus, all-to-all communication

• What if this is not possible?

- Communication goes through a directory - Directory is logically shared across processors

• Physically centralized or distributed

- Directory entry per memory-block - Possible states: uncached, shared, exclusive

• Use bit-vectors for storing these

Synchronization

• Required since communication is through shared memory• Synchronization primitives

- Involve atomic read-and-write of a memory location - Atomic exchange (with a register) - Test-and-set - Fetch-and-increment

Synchronization and Coherence

• Atomic read-and-write causes problems with coherence

- Additional complexity

• Solution: push complexity to software!

- Pair of instructions - Hardware support to tell if the two were executed atomically - Load-linked and store-conditional - Store fails if any intervening process switch, or coherence control operation

Load-Linked/Store-Conditional

• Can implement atomic-exchange, fetch-and-increment• Implementation issues:

- Use a link register to store address in previous load-linked instruction - Process switch, or coherence control operation will clear the link register - Store-conditional succeeds iff the address in it matches that in the link register

• Beware of what (and how many) instructions between the pair

Using Atomic Exchange for Spin-Locks

• Processor spins until it gets access to lock• Useful to test if lock is already held, before trying to lock• Even then, performance problems when multiple processors are trying to grab the lock

- Read/write misses generated by all processors - Misses satisfied sequentially

Barrier Locks

• Barrier is a synchronization primitive

- Can be used in programs - Forces all processors to wait until the last one reaches the barrier

• Can be implemented with two spin-locks

- One to increment a counter - One to hold the processors until barrier

• Can cause deadlock!

- Use count-down, or sense-reversing barrier

Performance Optimizations

• Exponential back-off• Queuing locks

Sequential Consistency

• Sequential consistency: result of execution same as if:

- Accesses executed by a processor are in order - Accesses among different processors are interleaved

• That is, there exists some interleaving which will lead to the same result on a uniprocessor

- Or, a multi-processor with no caches, and no write-buffers, and only a single centralized memory

Implementing Sequential Consistency

• Need to guarantee that a write/read completes before any other access (by the same processor)• Write completes == all invalidations have reached• This implies that write buffers cannot used (writes cannot be delayed in general)

Synchronized Programs

• Programs which protect access to shared locations through synchronization operations• More formally:

- In every possible execution, for every shared data - Write by a processor, and access (read/write) by another processor - Are separated by a synchronization operation

• That is, the program is data-race-free• Observation: most programs are synchronized

Sequential Consistency and Synchronized Programs

• Sequential consistency guarantees uniprocessor-like behaviour for any program

- True for synchronized programs too

• But sequential consistency is not necessary for uni-processor-like behaviour of

synchronized programs

• Define looser consistency models

- Can be implemented more efficiently than sequential consistency

Memory Access Orderings

• Four possibilities:

- R --> R, R --> W, W --> W, W --> R

• Sequential consistency guarantees all four orderings are preserved (in each processor)• Define synchronization operation S

- Synchronization acquire: Sa, release: Sr

• We only need to preserve:

- W --> Sr, R --> Sr - Sa --> W, Sa --> R - S --> S

Relaxed Consistency Models

• Total Store Order (TSO), or Processor Consistency: relax W-->R • Partial Store Order: relax W-->W also• Weak Ordering: relax R-->R, R-->W also• Release consistency: relax Sr-->W, Sr-->R, W-->Sa, R-->

Interconnection Networks

• Networks at three levels:

- Massively Parallel Processor (MPP) Network, within about 25m max - LAN: within about a few km max - WAN: larger

• Latency is higher in WAN• Cost of redundancy is higher in WAN

Switching versus Routing

• Switching: setup switches between source and destination• Routing: treat each packet individually• Switching:

- Switches separate from processors - Switches associated with processors

• Wormhole routing and cut-through routing are other possibilities

MPP Network Topology Design

• Design Criteria

- Minimum cost, Bisection bandwidth, Link/Node fault tolerance

• Topologies with switches separate from nodes: cross-bar, omega network• Topologies with switches as part of nodes: ring, 2-D torus, n-D hypercube

IBM's Blue Gene...

Input/Output

• We will mostly talk about storage systems• I/O performance is important!• Magnetic disks

-Platter, head, track, sector, cylinder seek time, rotational delay, transfer time

• Disk controller, controller delay• Queuing delay

Storage Technologies

• Magnetic disks, and the access time gap• Solid state disks, Expanded storage using DRAMs• Optical storage (read only)• Magnetic tapes

- Same technology as disks - Difference in geometry ==> cheaper but slower

Buses for Communication

• Between CPU/Memory, and with I/O devices• Advantages:

- Low-cost - Flexible/versatile

• Disadvantage:

- Communication bottleneck

• Bandwidth limited due to length, number of devices

Bus Design Choices

• CPU/Memory buses vs. I/O buses• Design choices in general:

- Bus width - Data width - Transfer size - Number of masters

- Split transaction - Synchronous vs. asynchronous

Other Design Choices

• Connecting I/O to memory or cache• Memory mapped vs. dedicated I/O instructions• Polling vs. interrupt-driven• Direct Memory Access (DMA)

- I/O processors for more intelligence

I/O Performance

• Producer-Server Model• Throughput vs. Response Time• Response time and think time• Queuing theory

- Arrival rate, service time, utilization - Little's law - Squared coefficient of variance - Average residual time - Response time and utilization - M/G/1 and M/M/1 models

I/O Performance

• Producer-Server Model• Throughput vs. Response Time• Response time and think time• Queuing theory

- Arrival rate, service time, utilization - Little's law - Squared coefficient of variance - Average residual time - Response time and utilization - M/G/1 and M/M/1 models

UNIX's Old File System

• Superblock• Free-list• Directory: special file - has pointer to file's inode• Inodes have:

- Direct points, singly indirect pointers, doubly indirect, and triply indirect pointers

• Problem: file's blocks get distributed all over the disk, deteriorating performance• Also, block size: 512 bytes (poor performance)

UNIX Fast File System (FFS)

• Cylinder groups are defined• Inodes are close to data blocks• Block size: 4096 bytes

- But, poor disk usage (close to 50% wasted)

• Idea: fragment blocks

- But only last block of file is allowed to be fragmented

• All files of a directory are preferably in same cylinder group• Other enhancements: long file names, file locking, symbolic links, rename, quota

Log-Structured File System

• Technological under-pinnings:

- Disk I/O becoming bottleneck since CPUs are getting faster - Disk I/O dominated by writes, since reads mostly served by main memory caching

• Characteristics of application workloads:

- Lots of accesses to small files - Random disk I/Os - Synchronous meta-data update in FFS => slow - FFS could use only about 5% disk bandwidth

The Log as the Structure

• Large asynchronous writes (0.5-1MB) to the end of the log• How to retrieve information from the log?

- Sequential search would be too slow

• I-node structure is same as in FFS• Getting to i-node given the i-node number uses i-node map (level of indirection)• I-node map is small enough to be in memory

Free Space Management

• What if log fills up disk?

- Threading vs. copying

• Intermediate solution: segments

- Thread across segments - Copy within segments

• Segment cleaning: copy live-data out of segment, to create free segments

- Segment with long-lived copy ==> can ignore while cleaning

Segment Cleaning

• Read a set of segments• Copy live data to new segments, create free segments• Need to identify:

- Which blocks are live - Which block belongs to which file - Segment summary information - Notion of file/inode version

Segment Cleaning Policies

• When should the cleaning be done?

- Periodically; after threshold disk utilization

• How many segments to clean at a time?

- Fixed; until achieving some number of clean segments

• Which segments to clean?

- Most fragmented; having the least utilization

• How should the blocks be grouped when writing out?

- All files in a dir in one place; age sort

Crash Recovery

• Checkpoint

- Checkpoint region is fixed!

• What to checkpoint?

- I-node map blocks, segment usage table, pointer to last segment written

• Roll-forward

- Read from last segment onwards - Update i-node map, segment usage table - Directory operation log, for consistency between directory entries and i-nodes

RAID

• Raid-1: Mirroring• Raid-2: Hamming code ECC• Raid-3: Bit-level parity• Raid-4: Block-level parity• Raid-5: Block-level distributed parity

Why Vector Processing

• Deep pipeline ==> more parallelism

- But more dependences - Need to fetch and issue many instructions (Flynn bottleneck)

• Same issues with multiple-issue processor• Operations on vectors:

- No data dependences - No control hazards - Single instn. ==> instn. bandwidth reduced - Well defined memory access pattern

Basic Architecture

• Vector-register processors vs. memory-memory vector processor• DLXV: vector extn. of DLX (vector-register)• Components:

- Vector registers (V0..V7), 64-element - Vector functional units:

• ADD/SUB, MUL, DIV, Integer, Logical • Each is pipelined, can start a new opn. every cycle

- Vector load/store unit: also pipelined - Scalar registers and scalar unit (like in DLX)

Some Vector Instructions

• ADDV V1, V2, V3• ADDSV V1, F0, V2• SUBV V1, V2, V3• SUBVS V1, V2, F0• SUBSV V1, F0, V2• Similar for MUL and DIV• LV V1, R1• SV R1, V1

SAXPY/DAXPY Loop

• Y =aX + Y (caps ==> vector)

LD F0, a LD F0, aADDI R4, Rx, 512 LV V1, Rx

Loop : LD F2, 0(Rx) MULTSV V2, F0, V1MULTD F2, F0, F2 LV V3, RyLD F4, 0(Ry) ADDV V4, V2, V3ADDD F4, F2, F4 SV Ry, V4SD 0(Ry), F4ADDI Rx, Rx, 8 Reduction in instn. bandwidthADDI Ry, Ry, 8 Lesser pipeline interlocksSUB R20, R4, RxBNEZ R20 L

Estimating Execution Time

• Convoy: set of vector instructions which can begin execution in same cycle

- Check for structural, data hazards

• For simplicity: convoy must complete before initiating next convoy• Chime: time taken to execute one vector opn.• Approximations:

- Only one instn. can be initiated per cycle - Pipeline setup latency

Adding Flexibility

• Vector-length register (VLR), Maximum vector length (MVL)

- MOVI2S VLR, R1 - MOVS2I R1, VLR

• Vector longer than MVL ==> use strip-mining• Vector stride:

- LVWS V1, (R1, R2) - SVWS (R1, R2), V1

• Memory-bank conflicts?

Enhancing Vector Performance

• Chaining: data-forwarding• Conditional execution:

- Vector Mask Register - Some related instructions

• SNEV V1, V2• SGTSV F0, V1 • CVM

• Sparse matrices: scatter-gather

- LVI V1, (R1+V2) - SVI (R1+V2), V1

Key Take-Away Ideas

• Quantitative approach to design• Amdahl's law• Design to match technology trend• Interface design• Pipelining, non-uniformity is bad• Golden rule: preserve programmer's view• Complexity in hardware vs. software• Caching: common across computer systems• Caching + VM: notion of infinite resources• Faster reads, postpone writes• Where to place what mapping?• Multiprocessing: affecting programmer's view• Consistency models• CAP principle

Lessons I Have Learnt

• Lessons on teaching:

- Teaching is quite different from learning - Good to teach outside topic of research

• On tools for teaching:

- OpenOffice rocks! - Teaching on board still better for some topics

• On student evaluations:

- Travel time is good to set papers - Group assignments better

Documents

electronspick.files.wordpress.com · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design