136
Computer Architecture "Architecture" - The art and science of designing and constructing buildings - A style and method of design and construction - Design, the way components fit together Computer Architecture - The overall design or structure of a computersystem, including the hardware and the software required to run it, especially the internal structure of the microprocessor Prerequisites Computer organization - Digital logic - Memory chips, number representation - Computer arithmetic, adders, ripple-carry... - I/O organization - Peripherals - Pipelining, RISC Course Contents Performance and CPI, benchmarks, Amdahl's law Pipelining, hazards Instruction Level Parallelism: Scoreboarding, Tomasulo's algorithm Dynamic branch prediction, VLIW, software pipelining Cache and memory systems I/O systems, RAID, benchmarks Multiprocessors, cache consistency protocols Processor networks Vector processors Course References

electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

Embed Size (px)

Citation preview

Page 1: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

Computer Architecture

•  "Architecture"

- The art and science of designing and constructing buildings - A style and method of design and construction - Design, the way components fit together

•  Computer Architecture

- The overall design or structure of a computersystem, including the hardware and the software required to run it, especially the internal structure of the microprocessor

Prerequisites

•  Computer organization

- Digital logic - Memory chips, number representation - Computer arithmetic, adders, ripple-carry... - I/O organization - Peripherals - Pipelining, RISC

Course Contents

 •  Performance and CPI, benchmarks, Amdahl's law •  Pipelining, hazards •  Instruction Level Parallelism: Scoreboarding, Tomasulo's algorithm •  Dynamic branch prediction, VLIW, software pipelining •  Cache and memory systems •  I/O systems, RAID, benchmarks •  Multiprocessors, cache consistency protocols •  Processor networks •  Vector processors

Course References 

•  "Computer Architecture: A Quantitative Approach" ,    2 nd edition, David A. Patterson and John L. Hennessy, Morgan Kaufmann Publishers•  CS252, Graduate Computer Architecture, U.C.Berkeley

Computer Architecture

•  Design aspects:

Page 2: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

- Instruction set- Cache and memory hierarchy- I/O, storage, disk- Multi-processors, networked-systems

•  Criteria: performance, cost, end-applications, complexity

Technology Trends

•  Since 1970s: Microprocessor-based•  Several PCs/Workstations put together can buy more cycles for the same cost

- The Berkeley NOW projec

•  Transistor density: 50% per year•  DRAM density: 60% per year•  Magnetic disk density: 50% per year•  Software:

- More memory usage- High-level language

•  Growth rate in CPU speed: 50% per year

- Architectural ideas: pipelining, caching, out of order execution, sophisticated compilers

•  Trends are important:

- Product cycle is 4 years!- Also beware of technology thresholds

Cost Trends

•  Cost depends on various factors:

- Time, volume, competition

•  Cost of IC:

- Cost of die + Testing + Packaging

•  Cost of die: Wafer-cost/Dies-per-wafer•  Yield is an important factor•  Cost proportional to Die-area^4

Upcoming Topics

Page 3: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

•  Performance metrics, CPI•  Amdahl's law

Performance Comparison 

•  What performance metric to use?

- User cares about response time- Performance is inversely proportional

•  What is execution time?

- Response time- CPU time: User time + System time

•  System performance vs. CPU performance

- Throughput vs. response-time

•  We will focus on CPU performance

Which Program's Execution Time?

•  Real "workload" is ideal•  Practical options:

- Real programs: compilers, office-suite, scientific...- Kernels: key pieces of programs

•  Example: Livermore loops

- Toy benchmarks: small programs

•  Examples: Quick-sort, tower of Hanoi...

- Synthetic benchmarks: try to capture "average" frequency of instructions in real programs

•  Example: Whetstone, Dhrystone

More on Performance Comparisons... 

•  Caveat of benchmarks

- They are needed- But manufacturers tend to optimize for benchmarks- Need to be updated periodically

Page 4: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

•  Benchmark suite: collection of programs

- E.g. SPEC92

•  Reporting performance

- Reproducibility: program version, compiler, flags- SPEC specifies compiler flags for baseline comparison

Some Numerics...

 Computer A Computer B Computer C

Program P1 (secs) 1 10 20Program P2 (secs) 1000 100 20

Total (secs) 1001 110 40

- Total (or average) execution time is a possible metric- Weighted execution time is better 

Normalizing the Performance

 Norm

(A)

Norm

(A)

Norm

(A)

Norm

(B)

Norm

(B)

Norm

(B)

Norm

(C)

Norm

(C)

Norm

(C)A B C A B C A B C

P1 1 10 20 0.1 1 2 0.05 0.5 1P2 1 0.1 0.02 10 1 0.2 50 5 1Am 1 5.05 10.01 5.05 1 1.1 25.03 2.75 1

•  Normalize such that all programs take the same time, on some machine•  Arithmetic mean predicts performance•  Geometric mean?

Summary

•  Performance inversely proportional to execution-time

- We are concerned with CPU time of unloaded machine

•  Weighted execution time with weights from real workload is ideal•  Else, normalize w.r.t one machine

Amdahl's Law 

Page 5: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

   •  Amdahl's law:

     - Diminishing returns     - Limit on overall speedup

   •  Corollary: make the common case fast

Next Lecture 

•  CPI as a measure of performance•  Illustration of Amdahl's law

Amdahl's Law

  •  Amdahl's law: 1-F

F

    - Diminishing returns    - Limit on overall speedup

1-FF/Speedup

      Overall speed up

  •  Corollary: make the common case fast

Illustrating Amdahl's Law 

•  Example: implement cache, or faster ALU?

- Cache improves performance by 10x- ALU improves performance by 3x

•  Depends on fraction of instructions

- Suppose F mem = 0.2 , F alu = 0.5, F other = 0.3

   Speedup with cache =

 

Page 6: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

    Speedup with faster ALU =

 

Example continued... 

•  Fixing F alu = 0.5 for what value of F mem is adding a cache better?

The CPU Performance Equation 

CPU time = Num.clock cycles X Clock cycle time

or

  CPU time = Num.of clock cycles Clock rate

FOR a Program

Num.of clock cycles

= Instruction Count X Cycles Per Instruction

= IC X CPI

Putting these together

CPU time =IC X CPI X Cycle time 

More on the Equation 

•  This form is convenient

- Involves many relevant parameters

•  Remembering is easy

Page 7: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

•  With CPI as the independent variable

Other Convenient Forms of the Equation

•  Number of clock cycles can be counted as:

•  Calculating in terms of

Usefulness of the Equation 

•   easier to measure than 

- Equivalently,   is measured through 

•  Equation includes relevant parameters such as the cycle time 

Announcements 

•  Course web-page is up

   http://web.cse.iitk.ac.in/~cs422/index.html

•  Lecture scribe notes:

- HTML please- Lec-notesXY-1.html or lec-notesXY-2.html- Images in directory "images/"

Page 8: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

•  lecXY-1-anything.ext or lecXY-2-anything.ext

- Please email to one of the TAs

Instruction Set

   • Interface design

           - Central part of any system design

           - Allows abstraction/independence

           - Challenges:

•  Should be easy to use by the layer above•  Should allow efficient implementation by the layer below 

Instruction Set Architecture (ISA) 

•  Main focus of early designs (1970s, 1980s)•  Mutual dependence between ISA design and:

- Machine organization - Example: caches- Higher level languages and compilers (what instructions do they want?)- Operating systems

•  Example: atomic instructions, paging...

The Design Space

Page 9: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

      Other design choices: determining branch conditions, instruction encoding 

Classes of ISAs

 

GPR Advantages 

•  Registers faster than memory•  Code density improves•  Easier for compiler to use

- Hold variables- Expression evaluation- Passing arguments

Page 10: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

Spectrum of GPR Choices

•  Choices based on

- How many memory operands allowed- How many total operands

 Number of memory

addressesMaximum number of

operands allowedExample

0 3 SPARC, MIPS, PowerPC1 2 80x86, Motorola2 2 VAX

mory Addressing

  •  Little-Endian Versus Big-Endian

  •  Aligned versus nonaligned access     of memory units > 1 byte

     - Misaligned ==> more memory    cycles for

Addressing Modes

 Addressing mode Example MeaningImmediate Add R4, #3 R4 <-- R4 + 3Register Add R4, R3 R4 <-- R4 + R3Direct or absolute Add R1, (1001) R1 <-- R1 + M[1001]Register deferred or indirect Add R4, (R1) R4 <-- R4 + M[R1]Displacement Add R4, 100(R1) R4 <-- R4 + M[100+R1]Indexed Add R3, (R1+R2) R3 <-- R3 + M[R1+R2]Auto-increment Add R1, (R2)+ R1 <-- R1 + M[R2]; R2 <-- R2 + d;Auto-decrement Add R1, -(R2) R2 <-- R2 - d; R1 <-- R1 + M[R2]Scaled Add R1, 100(R2)[R3] R1 <-- R1 + M[100+R2+R3*d]

Page 11: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

Usage of Addressing Modes

 

How many Bits for Displacement?

How many Bits for Immediate?

Page 12: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

Type and Size of Operands

Summary so far 

•  GPR is better than stack/accumulator•  Immediate and displacement most used memory addressing modes•  Number of bits for displacement: 12-16 bits•  Number of bits for immediate: 8-16 bits•  Next: what operations in instruction set?

Page 13: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

Deciding the Set of Operations 

80x86 instruction Integer averageLoad 22.00%Conditional branch 20.00%Compare 16.00%Store 12.00%Add 8.00%And 6.00%Sub 5.00%Move reg-reg 4.00%Call 1.00%

Simple instructions are used most!

Instructions for Control Flow

Design Issues for Control Flow Instructions

•  PC-relative addressing

- Useful since most jumps/branches are nearby- Gives position independence (dynamic linking)

•  Register indirect jumps

- Useful for many programming language features- Case statements, virtual functions, dynamic libraries

Page 14: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

•  How many bits for PC displacement?

- 8-10 bits are enough

What is the Nature of Compares?

Compare and Branch: Single Instruction or Two? 

•  Condition Code: set by ALU

- Advantage: simple, may be free- Disadvantage: extra state across instructions

•  Condition register: test any register with result of comparison

- Advantage: simple- Disadvantage: uses up a register

•  Compare and branch:

- Advantage: lesser instructions- Disadvantage: too much work in an instruction 

Managing Register State during Call/Return 

•  Caller save, or callee save?

- Combination of the two is possible

Page 15: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

•  Beware of global variables in registers!

Instruction Encoding Issues 

•  Need to encode: operation, and addressing mode of each operand

- Opcode is used for encoding operation- Simple set of addressing modes ==> can encode addressing mode also in opcode- Else, need address specifier per operand!

•  Challenges in encoding:

- Many registers and addressing modes- But, also minimize average instruction size- Encoding should be easy to handle in implementation (e.g. multiple of bytes) 

Styles of Encoding

Opcode Address-1 Address-2 Address-3

Fixed (e.g. DLX, MIPS, Power Pc)_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

Opcode, #operands

Addr. Spec-1 Address-1 Addr.

Spec-2Address-2 ...

Variable (e.g. VAX)_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

 Fixed: Hybrid approach: reduce

(+)ease of decoding(-)more of instruction

Variable in size, but providemultiple encoding lengths

Variable: (+) lesser number of instructions Examples: Intel 80x86(-) variance of amount of work on instruction

The Role of the Compiler

•  Compilers are central to ISA design

DLX  

•  DLX pronounced "Deluxe"•  Has the features of many recent experimental and commercial machines

Page 16: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

•  [ AMD 29K, DEC station 3100, HP 850, IBM 801, Intel i860, MIPS M/120A, MIPS M/1000,Motorola 88K,     RISC I, SGI 4D/60, SPARCstation-1, Sun-4/110, Sun-4/260 ] /13 = 560 = DLX (Roman)•  Good architectural features (e.g. simplicity), easy to understand

DLX Architecture: Registers and Data Types

•  Has 32 32-bit GPRs: R0...R31•  Also, FP registers

- 32 single precision: F0...F31- Or, 16 double precision: F0, F2, ... F30

•  Value of R0 is always ZERO!•  Data types:

- Integer: bytes, half-words, words- FP: single/double precision

DLX Memory Addressing 

•  Uses 32-bit, big-endian mode•  Addressing modes:

- Only immediate and displacement, with 16-bit fields

•  Register deferred?

•  Place zero in displacement field

•  Absolute

•  Use R0 for the register

DLX Instruction Format 

Opcode

(6)

RSI

(5)

RD

(5)

Immediate

(16)

I-type instruction: loads, stores, all immediate, conditional,                            branch, jump register, jump and link register_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ __ _ _ _ _ _ _ _ _ _ _ _ _ _ _  

Opcode RSI RS2 RD Func

Page 17: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

(6) (5) (5)(5)

(11 )

R-type instruction: register-register ALU operations_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ __ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Opcode

(6)

Offset relative to Pc

(26)

J-type instruction: jump, jump and link, trap and return

DLX Operations 

•  Four classes: Load/store, ALU, branch, FP•  ALU instructions are register-register•  R0 used to synthesize some operations:

- Examples: loading a constant, reg-reg move

•  Compares "set" a register•  Jump and link pushes next PC onto R31•  FP operations in single/double precision•  FP compares set a bit in a special status reg•  FP unit also used for integer multiply/divide! 

DLX Performance: MIPS vs VAX

Pipelining 

•  Its natural!•  Laundry example... (Randy Katz's slides)•  DLX has a simple architecture

- Easy to pipeline

•  Pipelining speedup:

- Can be viewed as reduction in CPI- Or, reduction in clock cycle

Page 18: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

•  Defining clock cycle as the amount of time between two successive instruction     completions 

A Simple DLX Implementation 

•  Instruction Fetch (IF) cycle:

- IR <-- M [PC]- NPC <-- PC + 4

•  Instruction Decode (ID) cycle:

- Done in parallel with register read (fixed field decode)- Register/Immediate read:

•  A <-- R [IR6..10]•  B <-- R[IR11..15]•  Imm <-- sign-extend (IR16..31)

•  Execution/effective address (EX) cycle:

- Memory reference:

•  ALUOutput <-- A + Imm

- Register-register ALU instruction:

•  ALUOutput <-- A func B

- Register-immediate ALU instruction:

•  ALUOutput <-- A op Imm

- Branch:

•  ALUOutput <-- NPC + Imm•  Cond <-- A op 0 [op is one of == or !=]

•  Memory access/branch completion (MEM) cycle:

- Memory access:

•  LMD <-- M[ALUOutput]•  Or, M[ALUOutput] <-- B

- Branch: PC = (cond)? ALUOutput : NPC

Page 19: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

•  Write-back (WB) cycle:

- Reg-reg ALU opn: R[IR16..20] <-- ALUOutput- Reg-imm ALU opn: R[IR11..15] <-- ALUOutput- Load instruction: R[IR11..15] <-- LMD

The DLX Data-path

Further lectures... 

•  Pipelining this data-path•  Pipelining issues

Page 20: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

 

ISA Design to Help the Compiler 

•  Regularity: operations, data-types, and addressing modes should be orthogonal; no   special registers/operands for some instructions

•  Provide simple primitives: do not optimize for a particular compiler of a particular   language

•  Clear trade-offs among alternatives: how to allocate registers, when to unroll a loop...

What lies ahead...

•  The DLX architecture•  DLX: simple data-path•  DLX: pipelined data-path•  Pipelining hazards, and how to handle them

 DLX Unpipelined Implementation 

•  Five cycles: IF, ID, EX, MEM, WB

- Branch and store instructions: 4 cycles only- What is the CPI?

F branch = 0.12, F store = 0.05

CPI = 0.17   4 + 0.83   5 = (5 - 0.17) = 4.83

•  Further reduction in CPI (without pipelining)•  ALU instructions can finish in 4 cycles too

Page 21: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

F ALU = 0.47 CPI = (4.83 - 0.47) = 4.36

Speedup = 4.83   4.36 = 1.1

Some Remarks 

•  Any further reduction in CPI will likely increase cycle time•  Some hardware redundancies can be eliminated

- Use ALU for (PC+4) addition also- Same I-cache and D-cache

•  These are minor improvements...

- An alternative single-cycle implementation:

•  Variation in amount of work ==> higher cycle time•  Hardware unit reuse is not possible 

The Basic Pipeline for DLX

CCI CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9l IF ID EX MEM WB

l+1 IF ID EX MEM WBl+2 IF ID EX MEM WBl+3 IF ID EX MEM WB IF1+4 IF ID EX MEM WB

•  That is it?•  Complications:

- Resource conflicts, Register conflicts, Branch instructions- Exceptions, Instruction set issues

The Pipelined Data-path

Page 22: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

Some Performance Numerics...  

Unpipeline clock cycle = 10ns

CPI ALU = CPI Branch = 4, CPI Other = 5

F ALU = 0.4, F Branch = 0.2, F Other = 0.4

Pipelined clock cycle = 11ns

      Speedup

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

For a single - cycle impl.T IF = 10ns ,T ID = 8ns

T EX = 10ns, T MEM = 10ns, T WB - 7ns

     Speedup from multi-cycle implementation =

Pipeline Hazards 

•  Structural Hazards: resource conflict

- Example: same cache/memory for instruction and data

Page 23: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

•  Data Hazards: same data item being accessed/written in nearby instructions

- Example:

•  ADD R1, R2, R3•  SUB R4, R1, R5

•  Control Hazards: branch instructions 

Structural Hazards 

•  Usually happen when a unit is not fully pipelined

- That unit cannot churn out one instruction per cycle

•  Or, when a resource has not been duplicated enough

- Example: same I-cache and D-cache

- Example: single write-port for register-file

•  Usual solution: stall

- Also called pipeline bubble , or simply bubble 

Stalling the Pipeline

CCI CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10Load IF ID EX MEM WB1+1 IF ID EX MEM WB1+2 IF ID EX MEM WB1+3 STALL IF ID EX MEM WB1+4 IF ID EX MEM WB

•  What is the slowdown due to stalls caused by such load instructions?

CPI without stalls = 1

CPI with stalls = 1+ F load

Slowdown = 1 + F load

Why Allow Structural Hazards? 

•  Lower Cost:

Page 24: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

- Lesser hardware ==> lesser cost

•  Shorter latency of unpipelined unit

- May have other performance benefits- Data hazards may introduce stalls anyway!

•  Suppose the FP unit is unpipelined, and the other instructions have a 5-stage pipeline.What percentage    of instructions can be FP, so that the CPI does not increase?     20% can be FP, assuming no clustering of FP instructions Even if clustered, data hazards may introduce      stalls anyway 

Data Hazards 

•  Example:

• ADD   R1 , R2, R3• SUB   R4, R1 , R5• AND   R6, R1 , R7• OR    R8, R1 , R9• XOR   R10, R1 , R11

•  All instructions after ADD depend on R1•  Stalling is a possibility

- Can we do better?

Register File: Reads after Writes

Page 25: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

Minimizing Stalls via Forwarding

Data Forwarding for Stores

Data Hazard Classification 

•  Read after Write (RAW): use data forwarding to overcome

•  Write after Write (WAW): arises only when writes can happen in different pipeline stages

CC1 CC2 CC3 CC4 CC5 CC6LW R1, 0(R2) IF ID EX MEM1 MEM2 WBADD R1, R2, R3 IF ID EX WB

- Has other problems as well: structural hazards

Write after Write (WAR): rare

CC1 CC2 CC3 CC4 CC5 CC6SW 0(R1), R2 IF ID EX MEM1 MEM2 WBADD R2, R3, R4 IF ID EX WB

Stalls due to Data Hazard

Page 26: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

Avoiding such Stalls 

•  Compiler scheduling:

- Example: a = b + c ; d = e + f ;

  •  LW R1, b  •  LW R2, c Without such schedulig,  •  LW R10, e what is the slow-down?  •  ADD R4, R1, R2  •  LW R11, f

1 + F loads causing stalls  •  SW a, R4  •  ADD R12, R10, R11  •  SW d, R12

Topics for Next Lecture 

•  Control hazards•  Exceptions during a pipeline

- More difficult to deal with- Cause more damage

Recall: Data Hazards 

Page 27: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

•  Have to be detected dynamically, and pipeline stalled if necessary•  Instruction issue: process of moving the instruction from ID stage to EX•  For DLX, all data hazards can be checked before instruction issue

- Also, control for data forwarding can be determined

- This is good since instruction is suspended before any machine state is updated

Opcode of ID/EX(ID/EX.IRO..5)

Opcode of IF/ID(IF/ID.IRO..5)

Check for interlock

Load Reg-Reg ALU ID/EX.IR11.15==IF/ID.IR6..10

Load Reg-Reg ALU ID/EX.IR11.15==IF/ID.IR11..15

LoadLoad, store, ALU immediate, or branch

ID/EX.IR11.15==IF/ID.IR6..10

Control Logic for Data-Forwarding 

•  Data forwarding always happens

- From ALU or data-memory output- To ALU input, data-memory input, or zero detection unit

•  Which registers to compare?

•  Compare the destination register field in EX/MEM and MEM/WB latches with the source register fields of IR in ID/EX and EX/MEM stages 

Control Hazard 

•  Result of branch instruction not known until end of MEM stage•  Naïve solution: stall until result of branch instruction is known

- That an instruction is a branch is known at the end of its ID cycle- Note: "IF" may have to be repeated

CCI CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9Branch IF ID EX MEM WBBranch succ IF STALL STALL IF ID EX MEM WBBranch succ+1 IF ID EX MEM

Reducing the Branch Delay 

Page 28: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

•  Three clock cycles wasted for every branch ==> significantly bad performance•  Two things to speedup:

- Determine earlier, if branch is taken- Compute PC earlier

•  Both can be done one cycle earlier•  But, beware of data hazard

Branch Behaviour of Programs 

•  Integer programs: 13% forward conditional, 3% backward conditional, 4% unconditional•  FP programs: 7%, 2%, and 1% respectively•  67% of branches are taken

- 60% forward branches are taken- 85% backward branches are taken

Handling Control Hazards 

•  Stall: Naïve solution•  Predict untaken or Predict not-taken:

- Treat every branch as not taken- Only slightly more complex- Do not update machine state until branch outcome is known- Done by clearing the IF/ID register of the fetched instruction

Predict Untaken Scheme

CCI CC2 CC3 CC4 CC5 CC6 CC7 CC8I (Untaken branch) IF ID EX MEM WBI + 1 IF ID EX MEM WBI + 2 IF ID EX MEM WBI + 3 IF ID EX MEM WB

CCI CC2 CC3 CC4 CC5 CC6 CC7 CC8I (Taken branch) IF ID EX MEM WBI + 1 IF Noop Noop Noop NoopTarget IF ID EX MEM WBTarget + 1 IF ID EX MEM WBTarget + 2 IF ID EX MEM

More Ways to Reduce Control Hazard Delays 

Page 29: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

•  Predict taken:

- Treat every branch as taken- Not of any use in DLX since branch target is not known before branch condition anyway

•  May be of use in other architectures

•  Delayed branch:

- Instruction(s) after branch are executed anyway!- Sequential successors are called branch-delay-slots 

Delayed Branch

EITHER OR CCI CC2 CC3 CC4 CC5 CC6 CC7 CC8I (Untaken branch)

I (Taken branch)

IF ID EX MEM WB

I + 1(Brach delay

I + 1(Brach delay)

IF ID EX MEM WB

I + 2 Target IF ID EX MEM WBI + 3 Target +1 IF ID EX MEM WBI + 4 Target +2 IF ID EX MEM

•  DLX has one delay-slot•  Note: another branch instruction cannot be put in delay-slot•  Compiler has to fill the delay-slots  

Filling the Delay-Slot: Option 1 of 3

     •  Fill the slot from before the branch instruction     •  Restriction: branch must not depend on result of the filled instruction     •  Improves performance: always

Filling the Delay-Slot: Option 2 of 3

Page 30: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

     •  Fill the slot from the target of the branch instruction     •  Restriction: should be OK to execute instruction even if not taken     •  Improves performance: when branch is taken

Filling the Delay-Slot: Option 3 of 3

     •  Fill the slot from fall through of the branch     •  Restriction: should be OK to execute instruction even if taken

Improves performance: when branch is not taken

Helping the Compiler

•  Encode the compiler prediction in the branch instruction

- CPU knows whether branch was predicted taken or not taken by compiler- Cancel or nullify if prediction incorrect- Known as canceling or nullifying branch

•  Options 2 and 3 can now be used without restrictions

Static Branch Prediction 

•  Predict-taken•  Predict-untaken

Page 31: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

•  Prediction based on direction (forward/backward)•  Profile-based prediction 

Static Misprediction Rates

Some Remarks 

•  Delayed branches are architecturally visible

- Strength as well as weakness- Advantage: better performance- Disadvantage: what if implementation changes?

•  Deeper pipeline ==> more branch delays ==> delay-slots may no longer be useful

- More powerful dynamic branch prediction

•  Note: need to remember extra PC while taking exceptions/interrupts•  Slowdown due to mispredictions: 1 + Branch frequency   Misprediction rate   Penalty 

Further Issues in Pipelining 

•  Exceptions1•  Instruction set issues•  Multi-cycle operations

Exceptions and Pipelining 

Page 32: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

•  What are exceptions?

•  I/O interrupt•  System call•  Tracing instruction execution, breakpoint•  Integer/FP anomaly•  Page fault•  Misaligned memory access•  Memory protection violation•  Undefined instruction•  Hardware malfunction/Power failure

•  Also called interrupts or faults

Exceptions: The Nemesis of Pipelining 

•  While taking exceptions, ensure that machine is in a "consistent" state•  Exceptions can occur:

- In many pipeline stages- Out of order

CCI CC2 CC3 CC4 CC5 CC6LW IF ID EX MEM WBADD IF ID EX MEM WB

 Classification of Exceptions 

•  Synchronous vs. Asynchronous

- Asynchronous usually caused by devices external to the processor- Asynchronous ==> can be handled after current instruction (easier)

•  User requested vs. Coerced

- User requested ==> can be handled after current instruction- Coerced ==> unpredictable

•  User maskable vs. Non-maskable•  Within vs. Between instructions

- Within ==> instruction cannot be completed, usually synchronous (harder)

•  Resume vs. Terminate

- Terminate process ==> easier

Page 33: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

Exception Classification

Exception type Synchrounous?

Coereced? Maskable? Within instn.?

Resume?

I/O request No Yes No No YesSys. Call Yes No No No Yes

Tracing/Brk. Pt. Yes No Yes No YesALU excpn. Yes Yes Yes Yes YesPage fault Yes Yes No Yes Yes

Misaligned. Mem. access

Yes Yes Yes Yes Yes

Protecn. Violn. Yes Yes No Yes YesUndefined

instns.Yes Yes No Yes No

H/W malfn./ power failure

No Yes No Yes No

Restarting Execution 

•  Restartable: take exception, save state, restart without affecting execution•  Restarting

- Force a trap instruction into pipeline- Until trap, disable all writes for faulting instruction and all subsequent ones- Trap into exception handling routine (OS)- Need to save more than one PC for delayed branches

•  Precise Exceptions: all instructions prior to faulting one completed, but not any other

Exceptions in DLX

CCI CC2 CC3 CC4 CC5 CC6LW IF ID EX MEM WBADD IF ID EX MEM WB

•  Exceptions can occur:

- In same cycle, or even out-of-order

•  Cannot handle an exception when it occurs in time

- Carry an instruction status in the pipeline latches- In WB stage, exception corresponding to earliest instruction will be handled

More Complications in Pipelining 

Page 34: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

•  Multiple write stages•  Or, changing processor state in the middle of an instruction

- E.g., Auto-increment addressing mode in VAX

•  Updating memory state during instruction

- E.g., String copy instruction in VAX

 •  Implicitly set condition codes

- Problems in scheduling the delay slot, and during exceptions

•  Self-modifying code in 80x86!• Multi-cycle operations

MOVL R1, R2ADDL3 42(R1), 56(R1)+, @(R1)SUBL2 R2, R3MOVC3 @(R1)(R2), 74(R2), R3

Data hazards very complicated to determine!VAX pipelines micro-instructions

Pipelining Multi-cycle Opns. 

•  Some operations take > 1 cycle (e.g. FP)•  Handling multi-cycle opns. in the pipeline:

- Multiple EX stages- Multiple functional units

Page 35: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

•  Two things to consider:

- Different units may take different # cycles- Some units may not be pipelined

•  Corresponding definitions:

- Latency : # cycles between an instn. & another which can use its result- Initiation/repeat interval : # cycles between issue of two operations of the same type 

The Multi-cycle Pipeline

Pipeline Timing: An Example

CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10 CC11MULTD IF ID M1 M2 M3 M4 M5 M6 M7 MEM WBADDD IF ID A1 A2 A3 A4 MEM WB

LD IF ID EX MEM

•  Additional details:

- We require more latches- ID/EX register must be expanded

More Hazards! 

•  Structural hazards:

Page 36: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

- Divide unit is not pipelined- Multiple writes possible in the same cycle

•  Data hazards:

- RAW is more frequent- WAW is possible

•  Control hazards:

- Out-of-order completion ==> difficulty in handling exceptions

Multiple Writes/Cycle: An Example

CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10 CC11MULTD F0,F4, F6

IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB

. IF ID EX MEM WB

. IF ID EX MEM WBADDDF2,F4,F6 IF ID A1 A2 A3 A4 MEM WB

. IF ID EX MEM WB

Multiple Writes/Cycle: Solution 

•  Provide multiple write-ports•  Or, detect and stall; Two possibilities:

- Detect in ID stage

•  Instruction reserve the write port using a reservation register•  Reservation register is shifted one bit each clock

- Detect in ID stage

•  Easier to check•  Can also give priority to longer cycle operation•  But, stall can now be in two places•  Stall may trickle back

 Handling WAW Hazards 

•  Occurs only when the result of ADDD is overwritten without any instruction using it!

- Otherwise, RAW hazard stall would have occurred

Page 37: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

•  Hazard can be detected in ID stage of latter instruction•  Two ways to handle:

- Delay issue of load until ADDD enters MEM- Stamp out result of ADDD

Control Hazard Complications

•  An example:

             - DIVF F0, F2, F4 // Finishes last; excepn.             - ADDF F10, F10, F8 // Finishes first             - SUBF F12, F12, F14 // Finishes second

•  Out-of-order completion causes problems!

- Precise exceptions are difficult to implement

Achieving Precise Exceptions  

•  Approach 1: Ostrich algorithm

- Don't care- May be provide a slower precise mode

•  Example: special instructions to check for FP exceptions

•  Approach 2: allow instruction issue to continue only if previous instructions willcomplete without exception

-  Stall to maintain precise exceptions

•  Approach 3: save state to undo

- Two possibilities

•  History file: keep track of original value of registers•  Future file: keep track of current value; main registerfile updated after all    previous instructions are done

- More buffer space required- Hazard checks and control become very complex

•  Approach 4: imprecise, but keep enough state for OS to recover

Page 38: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

- Keep track of incomplete instructions- OS then runs those instructions before returning control- Complicated to execute these instructions properly!

Next Topic...

  •  Instruction Level Parallelism (ILP)

 Instruction Level Parallelism 

•  Pipelining achieves Instruction Level Parallelism (ILP)

- Multiple instructions in parallel

•  But, problems with pipeline hazards

- CPI = Ideal CPI + stalls/instruction- Stalls = Structural + Data (RAW/WAW/WAR) + Control

•  How to reduce stalls?

- That is, how to increase ILP?

Techniques for Improving ILP 

•  Loop unrolling•  Basic pipeline scheduling•  Dynamic scheduling, scoreboarding, register renaming•  Dynamic memory disambiguation•  Dynamic branch prediction•  Multiple instruction issue per cycle

-  Software and hardware techniques

Loop-Level Parallelism 

•  Basic block: straight-line code w/o branches•  Fraction of branches: 0.15•  ILP is limited!

- Average basic-block size is 6-7 instructions- And, these may be dependent

•  Hence, look for parallelism beyond a basic block•  Loop-level parallelism is a simple example of this 

Page 39: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

Loop-Level Parallelism: An Example 

•  Consider the loop:

for(int i = 1000; i >= 1; i = i-1) {

x[i] = x[i] + C; // FP

}

- Each iteration of the loop is independent of other iterations- Loop-level parallelism

•  To convert it into ILP:

- Loop unrolling ( static , dynamic)- Vector instructions 

The Loop, in DLX 

•  In DLX, the loop looks like:

Loop: LD F0, 0(R1) // F0 is array elementADDD F4, F0, F2 // F2 has the scalar 'C'SD 0(R1), F4 // Stored resultSUBI R1, R1, 8 // For next iterationBNEZ R1, Loop // More iterations?

•  Assume:

- R1 is the initial address- F2 has the scalar value 'C'- Lowest address in array is '8'

How Many Cycles per Loop?

CC1 Loop: LD F0, 0(R1)

CC2 stall

CC3 ADDD F4, F0, F2

CC4 stallCC5 stall

CC6 SD 0(R1), F4

Page 40: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

CC7 SUBI R1, R1, 8

CC8 stall

CC9 BNEZ R1, Loop

CC10 stall

 

Reducing Stalls by Scheduling

CC1 Loop: LD LD F0, 0(R1)CC2 SUBI R1, R1, 8CC3 ADDD F4, F0, F2CC4 stallCC5 BNEZ R1, LoopCC6 SD 8(R1), F4

•  Realizing that SUBI and SD can be swapped is non-trivial!•  Overhead versus actual work:

- 3 cycles of work, 3 cycles of overhead

Unrolling the Loop 

Loop: LD F0, 0(R1)ADDD F4, F0, F2SD 0(R1), F4 // No SUBI, BNEZLD F6, -8(R1) // Note diff FP reg, new offsetADDD F8, F6, F2SD -8(R1), F8LD F10, -16(R1) // Note diff FP reg, new offsetADDD F12, F10, F2SD -16(R1), F8LD F14, -24(R1) // Note diff FP reg, new offsetADDD F16, F14, F2SD -24(R1), F16SUBI R1, R1, 32

How Many Cycles per Loop?

Loop: LD F0, 0(R1) // 1 stalADDD F4, F0, F2 // 2 stallsSD 0 (R1), F4

Page 41: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

LD F6, -8(R1) // 1 stall 28 cycles perADDD F8, F6, F2 // 2 stalls unrolled loopSD -8(R1), F8 ==LD F10, -16(R1) // 1 stall 7 cycles perADDD F12, F10, F2 // 2 stalls original loopSD -16(R1), F8LD F14, -24(R1) // 1 stallADDD F16, F14, F2 // 2 stallsSD -24(R1), F16SUBI R1, R1, 32 // 1 stall

 Scheduling the Unrolled Loop

Loop: LD F0, 0(R1)LD F6, -8(R1)LD F10, -16(R1) 14 cycles perLD F14, -24(R1) unrolled loopADDD F4, F0, F2 ==ADDD F8, F6, F2 3.5 cycles perADDD F12, F10, F2 original loopADDD F16, F14, F2SD 0(R1), F4SD -8(R1), F8SUBI R1, R1, 32SD 16(R1), F8BNEZ R1, Loop

Observations and Requirements  

•  Gain from scheduling is even higher for unrolled loop!

- More parallelism is exposed on unrolling

•  Need to know that 1000 is a multiple of 4•  Requirements:

- Determine that loop can be unrolled- Use different registers to avoid conflicts- Determine that SD can be moved after SUBI, and find the offset adjustment

•  Understand dependences

Dependences 

Page 42: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

•  Dependent instructions ==> cannot be in parallel•  Three kinds of dependences:

- Data dependence (RAW)- Name dependence (WAW and WAR)- Control dependence

•  Dependences are properties of programs•  Stalls are properties of the pipeline•  Two possibilities:

- Maintain dependence, but avoid stalls- Eliminate dependence by code transformation 

Data Dependence

Name Dependence 

•  Two instructions use the same register/memory (name), but there is no flow of data

- Anti-dependence: WAR hazard

- Output dependence: WAW hazard

•  Can do register renaming - s tatically, or dynamically

Page 43: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

Name Dependence in our Example

ILP: Recall 

•  Improving ILP == reducing stalls•  Loop unrolling enlarges the basic block

- More parallelism- More opportunity for better scheduling

•  Dependences:

- Data dependence- Name dependence- Control dependence

Handling Control Dependence 

•  Control dependence need not be maintained•  We need to maintain:

- Exception behaviour - do not cause new exceptions- Data flow - ensure the right data item is used

•  Speculation and conditional instructions are techniques to get around control dependence

Loop Unrolling: a Relook 

•  Our example:

for(int i = 1000; i >= 1; i = i-1) {

x[i] = x[i] + C; // FP

}

•  Consider:

for(int i = 1000; i >= 1; i = i-1) {

A[i-1] = A[i] + C[i]; // S1

B[i-1] = B[i] + A[i-1]; // S2

}

Page 44: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

  - S2 is dependent on S1  - S1 is dependent on its previous iteration; same case with S2

•  Loop-carried dependence ==> loop iterations have to be in-order

Removing Loop-Carried Dependence 

•  Another example:

for (int i = 1000; i >= 1; i = i-1) {

A[i] = A[i] + B[i]; // S1

B[i-1] = C[i] + D[i]; // S2

}

•  S1 depends on the prior iteration of S2

- Can be removed (no cyclic dependence)

A[1000] = A[1000] + B[1000];

for(int i = 1000; i >= 2; i = i-1) {

B[i-1] = C[i] + D[i]; // S2

A[i-1] = A[i-1] + B[i-1]; // S1

}

B[0] = C[1] + D[1];

Static vs. Dynamic Scheduling 

•  Static scheduling: limitations

- Dependences may not be known at compile time- Even if known, compiler becomes complex- Compiler has to have knowledge of pipeline

•  Dynamic scheduling

- Handle dynamic dependences- Simpler compiler- Efficient even if code compiled for a different pipeline

Page 45: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

Dynamic Scheduling 

•  For now, we will focus on overcoming data hazards•  The idea:

- DIVD F0 , F2, F4- ADDD F10, F0 , F8- SUBD F12, F8, F14

•  SUBD can proceed without waiting for DIVD 

CDC 6600: A Case Study 

•  IF stage: fetch instructions onto a queue•  ID stage is split into two stages:

- Issue: decode and check for structural hazards- Read operands: check for data hazards

•  Execution may begin, and may complete out of- order

- Complications in exception handling- Ignore for now

•  What is the logic for data hazard checks? 

The CDC Scoreboard 

•  Out-of-order completion ==> WAR and WAW hazards possible•  Scoreboard: a data-structure for all hazard detection in the presence of out-of-order execution/completion•  All instructions "consult" the scoreboard to detect hazards 

The Scoreboard Solution 

•  Three components:

- Stages of the pipeline:

•  Issue (ID1), Read-operands (ID2), EX, WB

- Data structure (in hardware)- Logic for hazard detection, stalling

Scoreboard Control & the Pipeline Stages 

Page 46: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

•  Issue (ID1): decode, check if functional unit is free, and if a previous instruction has the same destination     register

- No such hazard ==> scoreboard issues to the appropriate functional unit

•  Note: structural/WAW hazards prevented by stalling here•  Note: stall here ==> IF queue will grow

•  Read operands (ID2):

- Operand is available if no earlier instruction is going to write it, or if the register is being written currently- RAW hazards are resolved here

•  Execute (EX):

- Functional units perform execution- Scoreboard is notified on completion

•  Write-Back (WB):

- Check for WAR hazards

•  Stall on detection•  Write-back otherwise 

Some Remarks 

•  WAW causes stall in ID1, WAR causes stall in WB•  No forwarding logic

- Output written as soon as it is available (and no WAR hazard)

•  Structural hazard possible in register read/write

-  CDC has 16 functional units, and 4 buses

The Scoreboard Data-Structures 

•  Instruction status•  Functional unit status•  Register result status•  Randy Katz's CS252 slides... (Lecture 10, Spring 1996)

- Scoreboard pipeline control- A detailed example

Page 47: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

Limitations of the Scoreboard 

•  Speedup of 1.7 for (compiled) FORTRAN, speedup of 2.5 for hand-coded assembly•  Scoreboard only in basic-block!•  Some hazards still cause stalls:

- Structural- WAR, WAW

 

Control Dependence 

•  An example:

T1;

if p1 {

S1;

}

•  Statement S1 is control-dependent on p1, but T1 is not•  What this means for execution

Page 48: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

- S1 cannot be moved before p1- T1 cannot be moved after p1

Control Dependence in our Example

Dynamic Scheduling 

•  Better than static scheduling•  Score boarding:

- Used by the CDC 6600- Useful only within basic block- WAW and WAR stalls

•  Tomasulo algorithm:

- Used in IBM 360/91 for the FP unit- Main additional feature: register renaming to avoid WAR and WAW stalls 

Register Renaming: Basic Idea 

•  Compiler maps memory --> registers statically•  Register renaming maps registers --> virtual registers in hardware, dynamically•  Should keep track of this mapping

- Make sure to read the current value

•  Num. virtual registers > Num. ISA registers usually•  Virtual registers are known as reservation stations in the IBM 360/91

Tomasulo: Main Architectural Features 

•  Reservation stations: fetch and buffer operand as soon as it is available•  Load/store buffers: have the address (and data for store) to be loaded/stored•  Distributed hazard detection and execution control•  Common Data Bus (CDB): results passed from where generated to where needed•  Note: IBM 360/91 also had reg-mem instns.

The Tomasulo Architecture

Page 49: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

Pipeline Stages 

•  Issue:

- Wait for free Reservation Station (RS) or load/store buffer, and place instruction there- Rename registers in the process (WAR and WAW handled here)

•  Execute (EX):

- Monitor CDB for required operand- Checks for RAW hazard in this process

•  Write Result (WB):

- Write to CDB- Picked up by any RS, store buffer, or register

 Register Renaming 

•  In RS, operands referred to by a tag (if operand not already in a register)•  The tag refers to the RS (which contains the instruction) which will produce the required operand•  Thus each RS acts as a virtual register 

The Data Structure  

•  Three parts, like in the scoreboard:

Page 50: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

- Instruction status- Reservation stations, Load/Store buffers, Register file- Register status: which unit is going to produce the register value

•  This is the register --> virtual register mapping 

Components of RS, Reg. File, Load/Store Buffers 

•  Each RS has:

- Op: the operation (+, -, x, /)- Vj, Vk: the operands (if available)- Qj, Qk: the RS tag producing Vj/Vk (0 if Vj/Vk known)- Busy: is RS busy?

•  Each reg. in reg. file and store buffer has:

- Qi: tag of RS whose result should go to the reg. or the mem. locn. (blank ==> no such active RS)

•  Load and store buffers have:

- Busy field, store buffer has value V to be stored 

Maintaining the Data Structure 

•  Issue:

- Wait until: RS or buffer empty- Updates: Qj, Qk, Vj, Vk, Busy of RS/buffer; Maintain register mapping (register status)

•  Execute:

- Wait until: Qj=0 and Qk=0 (operands available)

•  Write result:

- CDB result picked up by RS (update Qj, Qk, Vj, Vk), store buffers (update Qi, V), register file (update register status)

- Update Busy of the RS which finished 

Some Examples 

•  Randy Katz's CS252 slides... (Lecture 11, Spring 1996)•  Dynamic loop unrolling example from text 

Page 51: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

Dynamic Loop Unrolling

Loop: LD F0, 0(R1) // F0 is array elementADDD F4, F0, F2 // F2 has the scalar 'C'SD 0(R1), F4 // Stored resultSUBI R1, R1, 8 // For next iterationBNEZ R1, Loop // More iterations?

•  Assume branch predicted to be taken•  Denote: load buffers as L1, L2..., ADDD RSs as A1, A2...•  First loop: F0 --> L1, F4 --> A1•  Second loop: F0 --> L2, F4 --> A2 

Summary Remarks 

•  Memory disambiguation required•  Drawbacks of Tomasulo:

- Large amount of hardware- Complex control logic- CDB is performance bottleneck

•  But:

- Required if designing for an old ISA- Multiple issue ==> register renaming and dynamic scheduling required

•  Next class: branch prediction

 

Page 52: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

 Dealing with Control Hazards 

•  Software techniques:

     - Branch delay slots     - Software branch prediction

•  Canceling or nullifying branches

     - Misprediction rates can be high     - Worse if multiple issue per cycle

•  Hence, hardware/dynamic branch prediction

Branch Prediction Buffer 

•  PC --> Taken/Not-Taken (T/NT) mapping•  Can use just the last few bits of PC

- Prediction may be that of some other branch- Ok since correctness is not affected

•  Shortcoming of this prediction scheme:

- Branch mispredicted twice for each execution of a loop

- Bad if loop is small

Page 53: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

for(int i = 0; i < 10; i++) {

x[i] = x[i] + C;

}

Two-Bit Predictor 

•  Have to mispredict twice before changing prediction

- Built in hysteresis

•  General case is an n-bit predictor

- 0 to (2^n)-1 saturating counter- 0 to (2^[n-1])-1 predict as taken- 2^[n-1] to (2^n)-1 predict as not-taken

•  Experimental studies: 2-bit as good as n-bit 

Implementing Branch Prediction Buffers 

•  Implementing branch prediction buffers

- Small cache accessed along with the instruction in IF- Or, additional 2 bits in instruction cache

•  Note: branch prediction buffer not useful for DLX pipeline

- Branch target not known earlier than branch condition 

Prediction Performance

Page 54: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

 

•  4096 entries in the prediction buffer•  SPEC89, IBM Power architecture 

Improving Branch Prediction 

•  Two ways: increase buffer size, improve accuracy

 Improving Prediction Accuracy 

Page 55: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

•  Predict branches based on outcomes of recent other branches

  if(aa == 2) {

aa = 0;

}

if(bb == 2) {

bb = 0;

}

if(aa == bb) {

// Do something

}  

•  Correlating, or two-level predictor

Two-Level Predictor 

•  There are effectively two predictors for each branch:

- Depending on whether previous branch is T/NT

Prediction bits

Prediction if last branch

NT

Prediction if last

branch TNT/NT NT NTNT/T NT TT/NT T NTT/T T T

 •  Last predictor was a (1,1) predictor

     - One bit each of history, and prediction

•  General case is (m,n) predictor

     - m bits of history, n bits of prediction

•  How to implement?

Page 56: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

     - Have an m-bit shift register

Cost of Two-Level Predictor 

•  Number of bits required:

     - Num. branch entries x 2^m x n

•  How many bits in 4096 (0,2) predictor?

     - 8K

•  How many branch entries for an 8K (2,2) predictor?

      - 1K

Performance of (2, 2) Predictor

 

Branch Target Buffer 

•  Branch prediction buffer is not useful for DLX

     - Need to know target address by the end of IF

•  Store branch target address also

     - Branch target buffer, or cache

Page 57: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

•  Access branch target buffer in IF cycle

     - Hit ==> predicted branch target known at the end of IF

     - We also need to know if the branch is predicted T/NT

 

Lookup based on PC

Predicated target

•  No entry found ==> (Target = PC+4)•  Exact match of PC is important

- Since we are predicting even before knowing that it is a branch instruction- Hardware is similar to a cache

•  Need to store predicted PC only for taken predictions

Steps in Using a Target Buffer

 

Penalties in Branch Prediction

Buffer hit? Branch taken? Penalty

Page 58: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

Yes Yes 0Yes No 2No - 2

•  Given a prediction accuracy of p, a buffer hitrate of h , and a taken branch frequency of f , what is the branch penalty?

- h x (1-p) x 2 + (1-h) x f x 2

Storing Target Instructions 

•  Directly store instructions instead of target address

     - Target buffer access is now allowed to take longer     - Or, branch folding can be achieved

•  Replace fetched instruction with that found in the target buffer entry•  Zero cycle unconditional branch; may be conditional as well

Increasing ILP through Multiple Issue 

•  With at most one issue per cycle, min CPI possible is 1

- But there are multiple functional units- Hence use multiple issue

•  Two ways to do multiple issue

- Superscalar processor

•  Issue varying number of instructions per cycle•  Static or dynamic scheduling

- Very Large Instruction Word (VLIW)

•  Issue a fixed number of instructions

Superscalar DLX 

•  Simple version: two instructions issued per cycle

- One integer (load, store, branch, integer ALU) and one FP- Instructions paired and aligned on 64-bit boundaries - int fi rst, FP next 

CC

CC

CC

CC4

CC5

CC

Page 59: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

1 2 3 6

Integer IF ID E

X

MEM

WB

FP IF ID EX

MEM

WB

Integer IF ID EX

MEM

WB

FP IF ID EXMEM

WB

 

•  No conflicts, almost...

- Assuming separate register sets, only FP load, store, move cause problems

•  Structural hazard on register port•  New RAW hazard between a pair of instructions

- Structural hazard:

•  Detect, and do not issue the FP operation•  Or, provide additional register ports

- RAW hazard:

•  Detect, and do not issue the FP operation

•  Also, result of LD cannot be used for 3 instns. 

Static Scheduling in the Superscalar DLX: An Example 

Loop: LD F0, 0(R1)       // F0 is array elementADDD F4, F0, F2       // F2 has the scalar 'C'SD 0(R1), F4       // Stored resultSUBI R1, R1, 8       // For next iterationBNEZ R1, Loop        // More iterations?

Loop: LD F0, 0(R1)LD F6, -8(R1)LD F10, -8(R1)    ADDD F4, F0, F2LD F14, -8(R1)    ADDD F8, F6, F2

Page 60: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

LD F18, -8(R1)    ADDD F12, F10, F2SD 0(R1), F4       ADDD F16, F14, F2SD -8(R1), F8     ADDD F20, F18, F2SD -16(R1), F12SUBI R1, R1, #40SD -24(R1), F16BNEZ R1, Loop

Dynamic Scheduling in the Superscalar DLX 

•  Scoreboard or Tomasulo can be applied•  Should preserve in-order issue!

- Use separate data structures for Int and FP

•  When the instruction pair has a dependence

- We wish to issue both in the same cycle- Two approaches:

•  Pipeline the issue stage, so that it runs twice as fast•  Exclude load/store buffers from the set of RSs

Multiple Issue using VLIW 

•  Superscalar ==> too much hardware

- For hazard detection, scheduling

•  Alternative: let compiler do all the scheduling

- VLIW (Very Large Instruction Word)- E.g., an VLIW may include 2 Int, 2 FP, 2 mem, and a branch

Limitations to Multiple Issue 

•  Why not 10 issues per cycle? Why not 20?•  Three limitations:

- Inherent ILP limitations in programs- Hardware costs (even for VLIW)

•  Memory/register bandwidth

- Implementation issues:

Page 61: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

•  Superscalar: complexity of hardware logic•  VLIW: increased code size, binary compatibility problems

Support for ILP 

•  Software (compiler) support•  Hardware support•  Combination of both

Compiler Support for ILP 

•  Loop unrolling:

- Dependence analysis is a major component- Analysis is simple when array indices are linear in the loop variable (called affine indices)

•  Limitations to dependence analysis:

- Pointers- Indirect indexing- Analysis has to consider corner cases too

•  Two important techniques:

- Software pipelining- Trace scheduling

•  Software pipelining: reorganize a loop such that each iteration is made from instructions chosen fromdifferent iterations of the originalloop 

Software Pipelining

Page 62: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

 

Software Pipelining in Our Example

Loop: LD F0, 0(R1)       // F0 is array elementADDD F4, F0, F2       // F2 has the scalar 'C'SD 0(R1), F4       // Stored resultSUBI R1, R1, 8        // For next iterationBNEZ R1, Loop         // More iterations?

Iter i: LD F0, 0(R1)ADDD F4, F0, F2      Software Pipelined LoopSD 0(R1), F4       SD                0(R1), F4

Iter i+1: LD F0, 0(R1)       Loop: SD                 16(R1), F4ADDD F4, F0, F2      ADDD F4, F0, F2SD 0(R1), F4        LD F0,           0(R1)

Iter i+2: LD F0, 0(R1)       SUBI             R1, R1, 8ADDD F4, F0, F2       BNEZ            R1, Loop

Trace Scheduling 

Page 63: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

 Hardware Support for Speculation 

•  Conditional or predicated instructions•  Execute on condition, annul otherwise•  Example: conditional move

if (A == 0) {S = T;}

BNEZ R1, LCMOVZ R2, R3, R1

MOV R2, R3

L:

•  Control dependence has been eliminated

-  Dependence resolution now moves to WB stage

Scheduling Using Conditional Instructions 

LW       R1, 40(R2)               ADD R3, R4, R5

ADD R6, R3, R7

BEQZ R10, L

LW      R8, 20(R12)

Page 64: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

LW      R9, 0(R8)

LW      R1, 40(R2)    ADD R3, R4, R5

LWC    R8, 20(R12), R10      ADD R6, R3, R7

BEQZ R10, L

LW      R9, 0(R8), R12

Empty slot filled, stall for last load eliminated

Limitations of Conditional Instructions

•  Usefulness limitations:

- Condition must not be delayed due to data dependence- Useful only for simple alternative sequences

•  Performance limitations:

- Annulled conditional instruction is equivalent to noop /stall

•  Except when filling an anyway empty slot

- Speed penalty in terms of higher clock cycle time

Speculation 

•  Wish to move instructions across branches

- To eliminate possible stalls- For better scheduling- Appropriate conditional instructions may not always exist

•  Example:

if (N == 0) {

A = *X;

} else {

A++;

Page 65: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

}

Speculation: An Example  

LW R1, 0(R2)      // Load NBNEZ R1, L1          // Test N LW      R1, 0(R2)      // Load NLW R3, 0(R4)    // Load X LW      R3, 0(R4)      // Load X

LW R5, 0(R3)    // Load *X LW      R5, 0(R3)      // Load *X

JMP L2             // Skip else BEQZ   R1, L3        // Test N  L1: LW R5, 0(R6)    // Load A LW      R5, 0(R6)      // Load A

ADDI R5, R5, #1   // Incrmt . ADDI   R5, R5, #1     // Incrmt  L2: SW 0(R6), R3    // Store A L2: SW     0(R6), R5 // Store A

•  Compiler predicts that the "then" clause is most likely•  Speculatively schedules the "then" clause

- Eliminates 2 stalls, and the JMP instruction 

Exception Behaviour

LW LWLW R3, 0(R4)      // Load XLW R5, 0(R3)      // Load *XBEQZ R1, L3           // Test NLW R5, 0(R6)      // Load AADDI R5, R5, #1     // Incrmt .

  L2: SW 0(R6), R5      // Store A

•  Terminating vs. non-terminating exceptions•  While doing such scheduling:

- Correct program ==> no extra terminating exceptions- Incorrect program ==> should preserve any terminating exceptions 

Preserving Exception Behaviour 

•  Approach 1: ignore terminating exceptions for speculated instructions

- Incorrect programs may not be terminated

LW R1, 0(R2)      // Load NLW* R3, 0(R4)      // Load X, speculatedLW* R5, 0(R3)      // Load *X, speculated

Page 66: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

BEQZ R1, L3           // Test NLW R5, 0(R6)      // Load AADDI R5, R5, #1    // Incrmt .

L2: SW 0(R6), R5      // Store A

•  Approach 2: poison bits

- Set poison bit in result register of conditional instruction, if exception occurs- Raise exception if any other instruction uses that register

LW R1, 0(R2)      // Load N

LW* R3, 0(R4)      // Load X, set poison bit on exception

LW* R5, 0(R3)      // Load *X, set poison bit on exception

BEQZ R1, L3           // Test NLW R5, 0(R6)      // Load AADDI R5, R5, #1    // Incrmt .

  L2: SW 0(R6), R5     // Store A

•  Extra register R10 gets used up•  Extra instruction in "else" clause 

•  Approach 3: buffer results

- Instructions boosted past branches, flagged as boosted (in opcode )- Results of boosted instructions forwarded and used, like in Tomasulo- When branch is reached, result of speculation is checked

•  Result committed if prediction correct•  Result discarded otherwise

- Solution close to fully hardware-based speculation

Boosted Instructions: An Example

LW R1, 0(R2)       // Load Nif (N == 0) { LW+ R3, 0(R4)      // Load X, boosted

A = *X; ADDI+ R1, R1, #1    // N++, boosted

N++; LW+ R5, 0(R3)      // Load *X, boosted

} else { BEQZ R1, L3            // Test N

Page 67: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

A++; LW R5, 0(R6)      // Load A} ADDI R5, R5, #1     // Incrmt .

L2: SW 0(R6), R5      // Store A

•  The "+" denotes a boosted instruction, and is boosted across the next branch,which is predicted taken 

Hardware-Based Speculation 

•  Combination of branch prediction, speculation, and dynamic scheduling•  Data flow execution : instruction executes as soon as the data it requires is ready•  Advantages over software approach:

- Memory disambiguation is better- Better branch prediction- Precise exception model- No book-keeping code- Works for "old " software too

•  Disadvantage: hardware cost and complexity

Speculation in Tomasulo 

•  Speculate using branch prediction•  Go ahead and execute based on speculation•  Use results of speculated instructions for other instructions, just as in Tomasulo•  But, commit result only after knowing if speculation was correct

- In-order commit- Using reorder buffer- Also achieves precise exceptions 

The Reorder Buffer 

•  Similar to the store buffer in functionality•  Replaces the store and load buffers•  Virtual registers are the reorder buffer entries

- The reservation stations are not virtual registers anymore

•  Reorder Buffer Data Structure

- Instruction type: branch, store, or ALU/load- Destination: register or memory location- Value: which has to be committed                        

Page 68: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

Tomasulo Using the Reorder Buffer 

 Pipeline Stages 

•  Issue, EX, WB, Commit•  Issue allocates a reorder buffer entry

- Entries allocated in circular fashion

•  Commit writes result back to destination

- Frees up the reorder buffer entry- For branch instruction

•  Prediction correct ==> commit•  Else, flush reorder buffer 

Summary of ILP Techniques 

•  Software techniques

- Compiler scheduling, Loop unrolling, Software pipelining, Trace scheduling (VLIW), Static branch prediction, Speculation

•  Hardware support for software

- Conditional instructions, poison bits

Page 69: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

•  Hardware techniques

- Hardware scheduling, Dynamic branch prediction, Hardware speculation

•  Which hardware technique(s) to use? 

How Much ILP is Available? 

•  Assume infinite hardware resources

- Infinite virtual registers- Perfect branch prediction, jump prediction- Perfect memory disambiguation

•  Every instruction is scheduled as early as possible

-  Restricted only by data flow

Available ILP in Programs 

 

 Window Size Limitation 

Page 70: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

 

Effect of Imperfect Branch Predictions 

 

Effect of Finite Virtual Register Set

Page 71: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

   

A Realizable Processor 

•  Up to 64 instruction issues per cycle•  Selective predictor with 1K entries, and a 16-entry return predictor•  Perfect memory disambiguation•  Register renaming with 64 integer virtual registers, and 64 FP virtual registers 

ILP for a Realizable Processor 

 

Page 72: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

Memory Hierarchy

•  Two principles:

- Smaller is faster - Principle of locality

•  Processor speed grows much faster than memory speed•  Registers - Cache - Memory - Disk

- Upper level vs. lower level

•  Cache design

Cache Design Questions  

•  Cache is arranged in terms of blocks

- To take advantage of spatial locality

•  Design choices:

- Q1: block placement - where to place a block in upper level? - Q2: block identification - how to find a block in upper level? - Q3: block replacement - which block to replace on a miss? - Q4: write strategy - what happens on a write?

Block Placement: Fully Associative  

Page 73: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

  Block Placement: Direct

  

  Block Placement: Set Associative

  

  Continuum of Choices

•  Memory has n blocks, cache has m blocks•  Fully associative is the same as set associative with one set ( m -way set associative)

Page 74: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

•  Direct placement is the same as 1-way set associative (with m sets)•  Most processors use direct, 2-way/4-way set associative

Block Identification  

•  How many different blocks of memory can be mapped (at different times) to a cache block?•  Fully associative: n•  Direct: n/m•  k-way set associative: k*n/m•  Each cache block has a tag saying which block of memory is currently present in it

- A valid bit is set to 0 if no memory block is in the cache block currently

•  How many bits for the tag?

•  How many sets in cache?

m/k

•  How many bits to identify the correct set?

•  How many blocks in memory?

repr esent block number in memory

•  Given a memory address:

- Select set using index , block from set using tag - Select location from block using block offset - tag + index = block address

  Block Replacement Policy

  •  Cache miss ==> bring block onto cache

Page 75: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

- What if no free block in set? - Need to replace a block

•  Possible policies:

- Random - Least-Recently Used (LRU)

•  Lesser miss-rate, but harder to implement

Replacement Policy Performance

  

Write Strategy

  •  Reads are dominant

- All instructions are read - Even for data, loads dominate over stores

•  Reads can be fast

- Can read from multiple blocks while performing tag comparison - Cannot do the same with writes

•  Should pay attention to write performance too!

  When do Writes go to Memory?  

Page 76: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

•  Write through: each write is mirrored to memory also

- Easier to implement

•  Write back: write to memory only when block is replaced

- Faster writes - Some writes do not go to memory at all! - But, read miss may cause more delay

•  Block being replaced has to be written back •  Optimize using dirty bit

- Also, bad for multiprocessors and I/O

Write Stalls

•  In write through, may have to stall waiting for write to complete

- Called a write stall - Can employ a write buffer to enable the processor to proceed during the write-through

What to do on a Write Miss?  

•  Write-allocate (or, fetch on write): load block on a cache miss during a write•  No-write allocate (or, write around): just write directly to main memory•  Write-allocate usually goes with write-back, and no-write allocate goes with write-through

The Alpha AXP 21064 Cache  

•  34-bit physical address

- 29 bits for block address - 5 bits for block offset

•  8 KB cache, direct-mapped

- 8 bits for index - 29 - 8 = 21 bits for tag

Steps in Memory Read  

•  Four steps:

- Step-1: CPU puts out the address - Step-2: Index selection 

Page 77: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

- Step-3: Tag comparison, read from data - Step-4: Data returned to CPU (assuming hit)

•  This takes two cycles

Steps in Memory Write

•  Write-through policy is used•  Write buffer with four entries

- Each entry can have up to 4 words from the same block - Write merging: successive writes to the same block use the same write-buffer entry

  Some More Details  

•  What happens on a miss?

- Cache sends signal to CPU asking it to wait - No replacement policy required (direct mapped) - Write miss ==> write-around

•  8KB separate instruction cache

Separate versus Unified Cache

•  Direct-mapped cache, 32-byte blocks, SPEC92, on DECstation 5000•  Unified cache has twice the size of Icache or D-cache•  75% instruction references

 

I-Cache D-Cache U-Cache

1KB 3.06% 24.61% 13.34%

2KB 2.26% 20.57% 9.78%

4KB 1.78% 15.94% 7.24%

8KB 1.10% 10.19% 4.57%

Page 78: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

16KB 0.64% 6.47% 2.87%

32KB 0.39% 4.82% 1.99%

64KB 0.15% 3.77% 1.35%

128KB 0.02% 2.88% 0.95%

Miss-rates

 Cache Performance

•  Miss rate is an important metric

- But not the only one

Avg. mem. access time = Hit time + Miss rate X Miss penalty

•  Hit time, Miss penalty can be expressed

- In absolute terms, - Or, in terms of number of clock cycles

•  Miss rate decrease may imply reduced performance

- Example: unified vs. split cache

  CPU Performance, with Cache

CPU time = ( CPU cycles + Mem. stall cycles) X Cycle time  

Mem. Stalls = Reads X Read miss rate X Read miss penalty + Writes X Write miss rate X Write miss penalty 

CPU time = IC X Cycle time X ( CPI + X Miss rate X Miss penalty)

 

Effect of Cache on Performance

Page 79: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

  •  Some typical values:

- CPI = 1 - Mem. per instn. = 1.35 - Miss rate = 2% - Miss penalty = 50

•  Mem. stalls comparable to CPI!

- Cache behaviour is an important component of performance - More important for lower CPI

Improving Cache Performance

  Avg. mem. access time = Hit time + Miss rate X Miss penalty

  •  Three possibilities:

- Reduce miss rate - Reduce miss penalty - Reduce hit time

•  Beware of slowing down the CPU!•  Example:

- Set associative ==> potentially higher cycle time

Cache Misses: The Three C's  

•  Compulsory: first access to a block

- Also called cold start, or first reference misses

•  Capacity: misses due to cache being small •  Conflict: two memory blocks mapping onto the same cache block

- Also called collision, or interference misses

The Three C's

Cache size

Associat ivity

Compulsory

Capacity

Conflict

Total Frac.Compulsory

Frac.Capacity

Frac.Conflict

1KB 1-way 0.20% 8.00% 5.20% 13.40%

0.01 0.6 0.39

1KB 2-way 0.20% 8.00% 2.30% 10.50 0.02 0.76 0.22

Page 80: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

%1KB 4-way 0.20% 8.00% 1.30% 9.50

%0.02 0.84 0.14

1KB 8-way 0.20% 8.00% 0.50% 8.70%

0.02 0.92 0.06

2KB 1-way 0.20% 4.40% 5.20% 9.80%

0.02 0.45 0.53

2KB 2-way 0.20% 4.40% 3.00% 7.60%

0.03 0.58 0.39

2KB 4-way 0.20% 4.40% 1.80% 6.40%

0.03 0.69 0.28

2KB 8-way 0.20% 4.40% 0.80% 5.40%

0.04 0.81 0.15

4KB 1-way 0.20% 3.10% 3.90% 7.20%

0.03 0.43 0.54

4KB 2-way 0.20% 3.10% 2.40% 5.70%

0.04 0.54 0.42

4KB 4-way 0.20% 3.10% 1.60% 4.90%

0.04 0.63 0.33

4KB 8-way 0.20% 3.10% 0.60% 3.90%

0.05 0.79 0.15

8KB 1-way 0.20% 2.30% 2.10% 4.60%

0.04 0.5 0.46

8KB 2-way 0.20% 2.30% 1.30% 3.80%

0.05 0.61 0.34

8KB 4-way 0.20% 2.30% 1.00% 3.50%

0.06 0.66 0.29

8KB 8-way 0.20% 2.30% 0.40% 2.90%

0.07 0.79 0.14

16KB

1-way 0.20% 1.50% 1.20% 2.90%

0.07 0.52 0.41

16KB

2-way 0.20% 1.50% 0.50% 2.20%

0.09 0.68 0.23

16KB

4-way 0.20% 1.50% 0.30% 2.00%

0.1 0.75 0.15

16KB

8-way 0.20% 1.50% 0.20% 1.90%

0.11 0.79 0.11

32KB

1-way 0.20% 1.00% 0.80% 2.00%

0.1 0.5 0.4

32KB

2-way 0.20% 1.00% 0.20% 1.40%

0.14 0.71 0.14

32KB

4-way 0.20% 1.00% 0.10% 1.30%

0.15 0.77 0.08

32K 8-way 0.20% 1.00% 0.10% 1.30 0.15 0.77 0.08

Page 81: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

B %64KB

1-way 0.20% 0.70% 0.50% 1.40%

0.14 0.5 0.36

64KB

2-way 0.20% 0.70% 0.10% 1.00%

0.2 0.7 0.1

Reducing Cache Misses

  •  Capacity: increase cache size

- Thrashing can happen otherwise

•  Conflict: increase associativity

- But, greater complexity, slower hit time

•  Compulsory: increase block size

- But, greater miss penalty!

Technique-1: Larger Blocks

•  Reduces compulsory misses

- By improving spatial locality

•  Increases miss penalty•  Also, may increase conflict/capacity misses

Cache size 1KB 4KB 16KB 64KB 256KB

Block size

16B 15.05% 8.57% 3.94% 2.04% 1.09%

32B 13.34% 7.24% 2.87% 1.35% 0.70%

64B 13.76% 7.00% 2.64% 1.06% 0.51%

128B 16.64% 7.78% 2.77% 1.02% 0.49%

Page 82: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

•  Miss penalty depends on:

- Memory latency, memory bandwidth

•  Assuming latency of 40 cycles, and bandwidth of 16 bytes per 2 cycles, AMAT values are:

Cache size 1KB 4KB 16KB 64KB 256KB

Block size MissPenalty

16B 42 7.32 4.6 2.66 1.86 1.46

32B 44 6.87 4.19 2.26 1.59 1.31

64B 48 7.61 4.36 2.27 1.51 1.25

128B 56 10.31 5.35 2.55 1.57 1.27

Technique-2: Higher Associativity

•  Reduces conflict misses•  But, increases hit time•  8-way as good as fully associative•  Rule of thumb:

- Direct mapped cache of size N has the same miss rate as a 2-way cache of size N/2

Technique-3: Victim Cache  

•  Small cache of "victim" blocks, which were thrown out recently

- Fully associative

Page 83: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

•  Reduces conflict misses•  Does not affect cycle time, or miss penalty•  Study: 4-entry victim cache removed 20-95% of conflict misses in a 4KB direct mapped cache

Technique-4: Pseudo- Associative Cache

•  Also called column associative•  Hit proceeds just as in a direct-mapped cache•  Miss ==> check in set (by flipping MSB of index)•  May need to swap contents in the set

Miss rate pseudo = Miss rate 2-way

Miss penalty pseudo = Miss penalty 1-way

Hit time pseudo = Hit time 1-way + Alt. hit rate X k

Alt. hit rate = Miss rate 1-way - Miss rate 2-way

Technique-5: Hardware Prefetching  

•  Fetch more than required, on a miss

- Prefetch into cache, or another small buffer (faster than memory)

Avg. mem. access time = Hit time + Miss rate X Prefetch hit rate X k + Miss rate X Prefetch miss rate X Miss penalty

  Technique-6: Compiler Controlled Prefetch  

•  Special instructions for prefetching data

- Non-faulting instructions are most useful - CPU should be able to proceed in parallel with cache

• Non-blocking cache

•  Example:

for (i = 0; i < 3; i++) {

for(j = 0; j < 100; j++) {

a[i][j] = b[j][0] * b[j+1][0];

}

Page 84: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

}

Technique-7: Compiler Optimizations

•  Merging arrays

int val[1000]; struct merge { int val; int key; };

int key[1000]; struct merge M[1000];- Improves spatial locality

•  Loop interchange

for(j = 0; j < 100; j++) for(i = 0; i < 100; i++)for(i = 0; i < 100; i++) for(j = 0; j < 100; j++)

x[i][j] = 0; x[i][j] = 0;- Improves spatial locality

  •  Loop fusion

for(i = 0; i < 100; i++)for(j = 0; j < 100; j++) for(i = 0; i < 100; i++)

a[i][j] = b[i][j] + c[i][j]; for(j = 0; j < 100; j++)

a[i][j] = b[i][j] + c[i][j];for(i = 0; i < 100; i++) d[i][j] = 2*a[i][j];

for(j = 0; j < 100; j++)d[i][j] = 2*a[i][j];

- Improves temporal locality

•  Blocking: operate on small blocks of matrices

- Improves temporal locality

Miss-Rate Reduction: Summary  

•  Larger blocks •  Higher associativity •  Victim cache •  Pseudo-associativity •  Hardware prefetching •  Software controlled prefetching •  Code optimization by compiler

Page 85: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

 Technique-1: Prioritize Read Misses over Writes  

•  Write-through cache ==> write-buffer

-  Beware of consistency- Example: store x, load y, load x - x and y in the same block

•  Possible solution: wait for write-buffer to clear before processing any read miss•  Better (but more complex) solution: check write buffer, and process read miss first•  Write-back cache: write-back dirty block after processing read miss 

Technique-2: Sub-Block Placement  

•  Sub-block: units smaller than the full block

- Valid bits added to sub-blocks- Only a sub-block read on cache miss

•  How is this different from just using a smaller block size?

- Tag length is reduced (good for on-chip cache)

Technique-3: Restart CPU ASAP

•  Early restart: CPU can proceed as soon as the requested word is loaded onto cache•  Critical word first: The requested word is fetched first

- A.k.a wrapped fetch, or requested word first

•  These are good for caches with large blocks•  What if another access to same block, before it is fully loaded?

-Stall if that portion of block not yet loaded

Technique-4: Non-blocking Cache  

•  For OOO CPUs (e.g. Tomasulo)

Page 86: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

- No point in stalling the CPU on a miss - Hit-under-miss allows hits while the cache is processing a miss - Hit-under-multiple-miss can benefit more - Miss-under-miss makes sense if main memory can handle more than one request in parallel

•  This significantly increases complexity of cache controller

Non-Block Cache Performance

  

  Technique-5: Second-Level Caches  

•  L1 cache can be small and fast•  L2 cache can be larger, but faster than main memory

Avg. mem. access time = Hit time L1 + Miss rate L1   Miss penalty L1 

Miss penalty L1 = Hit time L2 + Miss rate L2  Miss penalty L2

•  Local miss rate: misses w.r.t. memory accesses to this cache•  Global miss rate: misses w.r.t. memory access by CPU

Local and Global Miss Rates

  

Page 87: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

Second Level Cache Design

•  L2 can be larger

- Big enough to virtually eliminate capacity misses

•  Higher associativity does not hurt

- CPU clock cycle time is not affected

•  Larger block size to further reduce misses•  Multi-level inclusion property: L2 contains all data that L1 contains

- More work on a second-level miss

Small and Simple Caches

  •  Keep the cache small

- Faster - Can fit inside processor - Trade-off: tags within processor, data outside

•  Keep the cache simple

- Direct-mapped ==> tag comparison can be in parallel with data transmission

Other Techniques

•  Faster writes: pipeline writes

- Split the tag and data storage in cache - Pipeline stage-1: tag access and comparison - Pipeline stage-2: write data

•  Dealing with virtual address --> physical address translation

- Avoid it (virtually addressed caches) - In parallel with cache access (virtually indexed, physically tagged cache)

  Virtual Memory

•  Another level in the hierarchy•  Uses of virtual memory:

- Level of indirection

Page 88: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

•  No program overlays required •  Easy relocation

- Sharing and protection

Just like Memory-->Cache in Functionality...

•  Cache line •  Page or segment

•  Cache miss •  Page fault

•  Memory --> cache mapping

•  VA --> PA mapping (address translation)

  But Quite Different Quantitatively...

Parameter Memory --> Cache VM --> Ph. Memory

Hit time 1-2 Cycles 40-100 Cycles

Miss penalty 8-100 Cycles O(10ms-100ms)

Miss rate 0.5-10% 0.00001-0.001%

Block/Page size 16-128 Bytes 4-64KB

Upper level size 16KB-1MB O(1GB)

•  Page faults handled in software

- Be very careful about what you discard - Lots of time anyway

•  VM size determined by ISA•  VM is not quite the hard-disk...

Page 89: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

Paging versus Segmentation

Criterion Paging Segmentation

Block size Uniform (4 to 64KB) Variable (max :2^16-2^32,

Min: 1byte)Words per address

One Two

Programmer visible?

No Perhaps

Block replacement

Easy Need to find contiguous

memory

Memory use

inefficiency

Internal fragmentation External fragmentation

Efficient disk traffic?

Usually yes (for

appropriate page size)

Not for small segments

  •  Other possibilities:

- Paged segments - Choices for page size

The Four Memory Hierarchy Questions

  •  Where to place a block?

- Fully associative

•  How to find a block in main memory?

- Page table, or inverted table; cached in TLB

Page 90: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

•  Which block to replace?

- LRU, with the help of a use/reference bit

•  What happens on write?

- Write-back, write-allocate

  Trade-Offs in Page-Size

• Large page size good for: •  Smaller page size good for

- Smaller page tables - Efficient use of memory (lesser fragmentation)

- Lesser TLB miss rate -Faster process startup time

- Efficient disk or network transfer- Faster cache hits (how?)

Fast Translation

  •  Translation Look-aside Buffer (TLB)

- Small table in hardware - Fully associative - Fields:

• The translation, valid bit, use bit, dirty bit, protection bits

•  TLB access can be in critical path

- Pipeline TLB access - Overlap cache tag access with translation!

Overlapping Tag Access with Translation

Cache index Block offset

Page number Page offset

Tag access through index is independent of translation

Page 91: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

Cache index is virtual, but tags are physical

This limits cache size potentially

Solutions possible:

Higher associativity

Page colouring (set associativity)

Small guessing hardware

Alternate Strategy: Avoid Translation!  

•  Virtually addressed caches:

- Cache is accessed using the virtual address

•  Advantage: faster hit time

•  Disadvantages:

- Cache has to be flushed on process switch - What if two different VAs for the same PA?

•  Synonyms/aliases

- I/O usually uses PA to access memory/cache

Dealing with Virtually Addressed Caches  

•  Avoiding cache flush:

- Include a PID field in cache tag

•  Anti-aliasing

- Page colouring (set associativity) - Create "enough" colours (sets) to ensure that cache size <= block-size x number of sets 

- Cache has to be direct  mapped

Main Memory  

•  DRAM versus SRAM

- DRAM is cheaper, but slower

Page 92: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

•  Reducing the number of pins

- At the cost of some performance - Address = RAS + CAS

•  Performance metrics: latency and bandwidth

- #cycles to send address - #cycles to access a word - #cycles to send the data word

Main Memory Performance: One-Word Wide Memory

Suppose,#cycles to send address=4#cycles to access 1 word =24#cycles to send data word=4

Cache line=4words

What is the miss penalty?

4 x (4 + 24 + 4) = 128 cycles

Technique-1: Wider Memory

What is the miss penalty now?

2x(4+24+4)=64 cycles

Disadvantages?

Page 93: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

• Larger bus width (cost)

•  Unit of memory addition is larger

•  Read-modify-write for single-byte write, if error-correction present

Technique-2: Interleaved-Memory

Page 94: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

T echnique-3: Independent Memory Banks

  •  Multiple independent accesses

-Separate address and data lines

•  Needed for miss-under-miss scheme•  Also, parallel I/O with CPU•  Each independent bank may itself be interleaved

- Super-bank number and bank number

  Memory-Bank Conflicts

  •  Code can often be such that memory-bank conflicts occur

- No use of independent memory bank organization under such conflicts

•  Example:

int x[2][512];

for(j = 0; j < 512; j++) {

for(i = 0; i < 2; i++) {

x[i][j]++;

}

}

Technique-4: Avoiding Memory-Bank Conflicts

•  Software solutions:

- Loop interchange (works for this example) - Expand array size so that it is not a power of two

•  Hardware solution:

- Use prime number of banks

Bank num = Addr % # banks Addr within bank = Addr /# banks 

Page 95: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

Addr within bank = Addr # words within bank if #words within bank, and # banks are co-prime

Technique-5: DRAM-Specific Interleaving  

•  DRAM has RAS and CAS

- Usually RAS and CAS are given one after another - Same RAS can be used to read multiple columns- DRAMs come with separate signals to allow such access

Now, various remarks before finishing up with memory-hierarchy design

Virtual Memory and Protection

  •  OS requires support in terms of:

- Two modes (at least) of execution: user , supervisor/kernel - Some CPU state which is readable but not writable in user mode

•  TLB•  User/supervisor mode bit

- Mechanisms to switch between the modes

•  System calls

ILP and Caching

•  Superscalar execution:

- Cache must have enough ports to match the peak bandwidth - Hit-under-miss, Miss-under-miss required

•  Speculative execution:

- Suppress exception on speculative instructions - Don't stall the cache on a speculative instruction cache miss

  ILP vs. Caching: Compiler Choices

int x[32][512]; int x[32][512];for(j = 0; j < 512; j++) { for(i = 0; i < 32; i++) {

for(i = 0; i < 32; i++) { for(j = 0; j < 512; j++) {x[i][j] = 2*x[i][j-1]; x[i][j] = 2*x[i][j-1];} }

Page 96: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

} }

Caches and Consistency

•  I/O using caches?

- Interferes with CPU, may throw useful blocks

•  I/O using main memory

- Write-through ==> No problem for CPU output - What about input?

•  Approach-1: OS marks memory block as non-cacheable •  Approach-2: OS flushes the cache block after input •  Approach-3: h/w checks if block is present in cache, invalidate if cached (parallel set of tags for perf.)

•  Multi-processors - want same data in many caches: cache-coherence problem

 Why Multiprocessors?  

Motivation: Opportunity :Go beyond the performance offered by a single processor Software available

Without requiring Specialized processors Parallel programs

Without the complexity of too much multiple issue

Multi-programmed Machines

Multiprocessors: The SIMD Model

  •  SISD: Single Instruction stream, Single Data stream

- Uniprocessor - This is the view at the ISA level - Tomasulo uncovers data stream parallelism

•  SIMD: Single Instruction stream, Multiple Data streams

- ISA makes data parallelism explicit - Special SIMD instructions - Same instruction goes to multiple functional units, but acts on different data

SIMD Drawbacks

Page 97: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

SIMD useful for loop-level parallelism Model is too inflexible to accommodate parallel programs as well as multiprogrammed environments Cannot take advantage of uniprocessor performance growth SIMD architecture usually used in special purpose designs Signal or image processing

Multiprocessors: The MIMD Model

  •  MIMD: Multiple Instruction streams, Multiple Data streams

- Each processor fetches its own instruction and data

•  Advantages:

- Flexibility: parallel programs, or multiprogrammed OS, or both - Built using off-the-shelf uniprocessors

MIMD: The Centralized Shared-Memory Model

  

Single bus connects a shared memory to all processors Also called Uniform Memory Access (UMA) machine Disadvantage: cannot scale very well, especially with fast processors (more memory bandwidth required)

MIMD: Physically Distributed Memory

Page 98: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

Independent memory for each processor High-bandwidthInterconnection Adv: cost-effective memory bandwidth scalingAdv: lesser latency for local access Disadv: communication of data between nodes

Communication Models with Physically Distributed Memory

  •  Distributed Shared Memory (DSM)

- Memory address space is the same across nodes - Also called scalable shared memory - Also called NUMA: non-uniform memory access- Communication is implicit via load/store

•  Multicomputer, or Message Passing Machine

- Separate private address spaces for each node - Communication is explicit, through messages - Synchronous, or asynchronous - Std. Message Passing Interface (MPI) possible

  Multiprocessing: Classification

Page 99: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

    

Multiprocessing:

Classification   

DSM vs. Message Passing  

Shared Memory Message Passing

Page 100: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

Well understood mechanisms for programming

Hardware simplicity

Program independent of communication pattern

Communication is explicit - forces programmer to pay attention to what is expensive

Low overhead for communicating small itemsHardware controlled caching

Achieving the Desired Communication Model

Message Passing on top of Shared Memory Considerable easier

Difficulty arises in dealing with arbitrary message lengths

Shared Memory on top of Message Passing

Harder since every load/store has to be faked

Every memory reference may involve OS

One promising direction: use of VM to share objects at page level: shared VM 

Challenges in Parallel Processing  

•  Limited parallelism available in programs

- 90% parallelizable ==> max speed possible? - Exception: super-linear speedup

•  Increased memory/cache available •  Usually not very great however

•  Large latency of communication

- 50-10000 clock cycles - 0.5% instructions access remote memory ==> what is the increase in CPI?

Addressing the Challenges

•  Limited parallelism

- Tackled mainly by redesigning the algorithm or software

•  Avoiding large latency

Page 101: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

- Hardware mechanism: caching - Software mechanism: restructure to make more accesses local

Some Example Applications

  •  Two classes

- Parallel programs or program kernels - Multi-programmed OS

•  Spatial and temporal data access patterns are important•  Computation to communication ratio is important

Parallel Application Kernels

•  The FFT kernel

- Used in spectral methods - Data represented as array - Computation involves

•  1D FFT on each row •  Transpose •  1D FFT on each row again

- Each processor gets a few rows of data - Main communication step is the transpose (all to all communication)

•  The LU kernel

- LU factorization of a matrix - Blocking is used - Computation (dense matrix multiply) is performed by processor which owns the destination block - Communication happens at regular intervals

  Parallel Applications

  •  Barnes application

- N-body problem - Octree representation - Each processor is allocated a sub tree - Tree expansion as required (communication in this process)

 •  Ocean application

Page 102: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

- Influence of eddy and boundary currents on ocean flows - Involves solving PDEs - Ocean divided into hierarchy of grids (finer grid for more accuracy) - Each processor gets a set of grids - Communication to exchange boundary conditions, at each step of the process

Computation to Communication Ratios  

Application Computation scalingCommunication scalingScaling of computation tocommunication

FFT nlogn/p n/p Logn

LU n/p sqrt(n/p) sqrt(n/p)

Barnes nlogn/p logn*sqrt(n/p) sqrt(n/p)

Ocean n/p sqrt(n/p) sqrt(n/p)

  Multiprogrammed OS workload

  •  Workload used here is:

- Two independent copies of the compilation of the Andrew benchmark - Three steps:

•  Compilation: compute intensive •  Installing object files in a library: I/O intensive•  Removing the object files: I/O intensive

Cache Coherence  

•  In what kind of multi-processors do we need cache coherence?

•  What are the kinds of data which are cached?

- Shared (read) data - replication - Private data - migration

  Notions of Coherence and Consistency

•  Coherence:

Page 103: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

- Program order preservation within a processor - Write by P1, Read by P2, if "sufficiently separated", should get the value of write - Write serialization: same order of writes seen by all processors

•  Specifying when a read should get the value of a write: memory consistency model

  Styles of Coherence Protocols  

•  Directory-based: central directory maintains the "status" of each block

•  Snooping-based:

- In a centralized shared memory machine - Each processor snoops on the common bus - Also maintains the "status" of a block locally (no central directory) - Snooping helps maintain coherence

Styles of Snooping Protocols

•  Write-invalidate: processor makes sure that it has the only copy of a block before writing

- Invalidates other copies by sending an invalidate command on the bus

•  Write-update or write-broadcast:

processor updates all copies of a block when it writes

- Send the written data on the common bus

Write-invalidate vs. Write-update

Write-invalidate Write-updateConsecutive writes to a location does not cause repeated traffic on bus

Writes appear for readers with lesser latency

Writes to consecutive locations does not cause extra traffic on busSnooping-Based Protocols  

 •  Applicable for write-through as well as write back caches •  Optimizations:

- Shared/Exclusive bit - Write miss and invalidate messages - Shared + single versus truly shared blocks - Maintaining separate tags

Page 104: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

•  Dirty bit required?

Towards Directory-Based Protocols  

 •  What properties of the multi-processor does a snooping-based protocol use?

- Broadcast-based bus, all-to-all communication

 •  What if this is not possible?

- Communication goes through a directory - Directory is logically shared across processors

•  Physically centralized or distributed

- Directory entry per memory-block - Possible states: uncached, shared, exclusive

•  Use bit-vectors for storing these

Synchronization  

•  Required since communication is through shared memory•  Synchronization primitives

- Involve atomic read-and-write of a memory location - Atomic exchange (with a register) - Test-and-set - Fetch-and-increment

Synchronization and Coherence

  •  Atomic read-and-write causes problems with coherence

- Additional complexity

•  Solution: push complexity to software!

- Pair of instructions - Hardware support to tell if the two were executed atomically - Load-linked and store-conditional - Store fails if any intervening process switch, or coherence control operation

Load-Linked/Store-Conditional  

Page 105: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

•  Can implement atomic-exchange, fetch-and-increment•  Implementation issues:

- Use a link register to store address in previous load-linked instruction - Process switch, or coherence control operation will clear the link register - Store-conditional succeeds iff the address in it matches that in the link register

•  Beware of what (and how many) instructions between the pair

Using Atomic Exchange for Spin-Locks

•  Processor spins until it gets access to lock•  Useful to test if lock is already held, before trying to lock•  Even then, performance problems when multiple processors are trying to grab the lock

- Read/write misses generated by all processors - Misses satisfied sequentially

Barrier Locks  

•  Barrier is a synchronization primitive

- Can be used in programs - Forces all processors to wait until the last one reaches the barrier

•  Can be implemented with two spin-locks

- One to increment a counter - One to hold the processors until barrier

•  Can cause deadlock!

- Use count-down, or sense-reversing barrier  

Performance Optimizations  

•  Exponential back-off•  Queuing locks

Sequential Consistency  

•  Sequential consistency: result of execution same as if:

- Accesses executed by a processor are in order - Accesses among different processors are interleaved

Page 106: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

•  That is, there exists some interleaving which will lead to the same result on a uniprocessor

- Or, a multi-processor with no caches, and no write-buffers, and only a single centralized memory

Implementing Sequential Consistency  

•  Need to guarantee that a write/read completes before any other access (by the same processor)•  Write completes == all invalidations have reached•  This implies that write buffers cannot used (writes cannot be delayed in general)

Synchronized Programs  

•  Programs which protect access to shared locations through synchronization operations•  More formally:

- In every possible execution, for every shared data - Write by a processor, and access (read/write) by another processor - Are separated by a synchronization operation

•  That is, the program is data-race-free•  Observation: most programs are synchronized  

Sequential Consistency and Synchronized Programs  

•  Sequential consistency guarantees uniprocessor-like behaviour for any program

- True for synchronized programs too

•  But sequential consistency is not necessary for uni-processor-like behaviour of

synchronized programs

•  Define looser consistency models

- Can be implemented more efficiently than sequential consistency

Memory Access Orderings

  •  Four possibilities:

- R --> R, R --> W, W --> W, W --> R

•  Sequential consistency guarantees all four orderings are preserved (in each processor)•  Define synchronization operation S

Page 107: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

- Synchronization acquire: Sa, release: Sr

•  We only need to preserve:

- W --> Sr, R --> Sr - Sa --> W, Sa --> R - S --> S

Relaxed Consistency Models  

•  Total Store Order (TSO), or Processor Consistency: relax W-->R •  Partial Store Order: relax W-->W also•  Weak Ordering: relax R-->R, R-->W also•  Release consistency: relax Sr-->W, Sr-->R, W-->Sa, R-->

Interconnection Networks  

•  Networks at three levels:

- Massively Parallel Processor (MPP) Network, within about 25m max - LAN: within about a few km max - WAN: larger

•  Latency is higher in WAN•  Cost of redundancy is higher in WAN

Switching versus Routing  

•  Switching: setup switches between source and destination•  Routing: treat each packet individually•  Switching:

- Switches separate from processors - Switches associated with processors

•  Wormhole routing and cut-through routing are other possibilities

MPP Network Topology Design  

•  Design Criteria

- Minimum cost, Bisection bandwidth, Link/Node fault tolerance

•  Topologies with switches separate from nodes: cross-bar, omega network•  Topologies with switches as part of nodes: ring, 2-D torus, n-D hypercube

Page 108: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

IBM's Blue Gene...

Input/Output  

•  We will mostly talk about storage systems•  I/O performance is important!•  Magnetic disks

-Platter, head, track, sector, cylinder seek time, rotational delay, transfer time

•  Disk controller, controller delay•  Queuing delay  

Storage Technologies  

•  Magnetic disks, and the access time gap•  Solid state disks, Expanded storage using DRAMs•  Optical storage (read only)•  Magnetic tapes

- Same technology as disks - Difference in geometry ==> cheaper but slower

Buses for Communication  

•  Between CPU/Memory, and with I/O devices•  Advantages:

- Low-cost - Flexible/versatile

•  Disadvantage:

- Communication bottleneck

•  Bandwidth limited due to length, number of devices

  Bus Design Choices  

•  CPU/Memory buses vs. I/O buses•  Design choices in general:

- Bus width - Data width - Transfer size - Number of masters 

Page 109: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

- Split transaction - Synchronous vs. asynchronous

Other Design Choices

•  Connecting I/O to memory or cache•  Memory mapped vs. dedicated I/O instructions•  Polling vs. interrupt-driven•  Direct Memory Access (DMA)

- I/O processors for more intelligence  

I/O Performance

•  Producer-Server Model•  Throughput vs. Response Time•  Response time and think time•  Queuing theory

- Arrival rate, service time, utilization - Little's law - Squared coefficient of variance - Average residual time - Response time and utilization - M/G/1 and M/M/1 models

I/O Performance  

•  Producer-Server Model•  Throughput vs. Response Time•  Response time and think time•  Queuing theory

- Arrival rate, service time, utilization - Little's law - Squared coefficient of variance - Average residual time - Response time and utilization - M/G/1 and M/M/1 models

UNIX's Old File System  

•  Superblock•  Free-list•  Directory: special file - has pointer to file's inode•  Inodes have:

Page 110: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

- Direct points, singly indirect pointers, doubly indirect, and triply indirect pointers

•  Problem: file's blocks get distributed all over the disk, deteriorating performance•  Also, block size: 512 bytes (poor performance)  

UNIX Fast File System (FFS)  

•  Cylinder groups are defined•  Inodes are close to data blocks•  Block size: 4096 bytes

- But, poor disk usage (close to 50% wasted)

•  Idea: fragment blocks

- But only last block of file is allowed to be fragmented

•  All files of a directory are preferably in same cylinder group•  Other enhancements: long file names, file locking, symbolic links, rename, quota

Log-Structured File System  

•  Technological under-pinnings:

- Disk I/O becoming bottleneck since CPUs are getting faster - Disk I/O dominated by writes, since reads mostly served by main memory caching

•  Characteristics of application workloads:

- Lots of accesses to small files - Random disk I/Os - Synchronous meta-data update in FFS => slow - FFS could use only about 5% disk bandwidth  

The Log as the Structure  

•  Large asynchronous writes (0.5-1MB) to the end of the log•  How to retrieve information from the log?

- Sequential search would be too slow

•  I-node structure is same as in FFS•  Getting to i-node given the i-node number uses i-node map (level of indirection)•  I-node map is small enough to be in memory  

Free Space Management  

Page 111: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

•  What if log fills up disk?

- Threading vs. copying

•  Intermediate solution: segments

- Thread across segments - Copy within segments

•  Segment cleaning: copy live-data out of segment, to create free segments

- Segment with long-lived copy ==> can ignore while cleaning  

Segment Cleaning  

•  Read a set of segments•  Copy live data to new segments, create free segments•  Need to identify:

- Which blocks are live - Which block belongs to which file - Segment summary information - Notion of file/inode version

Segment Cleaning Policies  

•  When should the cleaning be done?

- Periodically; after threshold disk utilization

•  How many segments to clean at a time?

- Fixed; until achieving some number of clean segments

•  Which segments to clean?

- Most fragmented; having the least utilization

•  How should the blocks be grouped when writing out?

- All files in a dir in one place; age sort  

Crash Recovery

•  Checkpoint

Page 112: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

- Checkpoint region is fixed!

•  What to checkpoint?

- I-node map blocks, segment usage table, pointer to last segment written

•  Roll-forward

- Read from last segment onwards - Update i-node map, segment usage table - Directory operation log, for consistency between directory entries and i-nodes

  RAID  

•  Raid-1: Mirroring•  Raid-2: Hamming code ECC•  Raid-3: Bit-level parity•  Raid-4: Block-level parity•  Raid-5: Block-level distributed parity

Why Vector Processing  

•  Deep pipeline ==> more parallelism

- But more dependences - Need to fetch and issue many instructions (Flynn bottleneck)

•  Same issues with multiple-issue processor•  Operations on vectors:

- No data dependences - No control hazards - Single instn. ==> instn. bandwidth reduced - Well defined memory access pattern  

Basic Architecture  

•  Vector-register processors vs. memory-memory vector processor•  DLXV: vector extn. of DLX (vector-register)•  Components:

- Vector registers (V0..V7), 64-element - Vector functional units:

•  ADD/SUB, MUL, DIV, Integer, Logical •  Each is pipelined, can start a new opn. every cycle

Page 113: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

- Vector load/store unit: also pipelined - Scalar registers and scalar unit (like in DLX)  

Some Vector Instructions  

•  ADDV V1, V2, V3•  ADDSV V1, F0, V2•  SUBV V1, V2, V3•  SUBVS V1, V2, F0•  SUBSV V1, F0, V2•  Similar for MUL and DIV•  LV V1, R1• SV R1, V1

SAXPY/DAXPY Loop  

•  Y =aX + Y (caps ==> vector)

LD F0, a LD F0, aADDI R4, Rx, 512 LV V1, Rx

Loop : LD F2, 0(Rx) MULTSV V2, F0, V1MULTD F2, F0, F2 LV V3, RyLD F4, 0(Ry) ADDV V4, V2, V3ADDD F4, F2, F4 SV Ry, V4SD 0(Ry), F4ADDI Rx, Rx, 8 Reduction in instn. bandwidthADDI Ry, Ry, 8 Lesser pipeline interlocksSUB R20, R4, RxBNEZ R20 L

Estimating Execution Time  

•  Convoy: set of vector instructions which can begin execution in same cycle

- Check for structural, data hazards

•  For simplicity: convoy must complete before initiating next convoy•  Chime: time taken to execute one vector opn.•  Approximations:

- Only one instn. can be initiated per cycle - Pipeline setup latency

Adding Flexibility  

Page 114: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

•  Vector-length register (VLR), Maximum vector length (MVL)

- MOVI2S VLR, R1 - MOVS2I R1, VLR

•  Vector longer than MVL ==> use strip-mining•  Vector stride:

- LVWS V1, (R1, R2) - SVWS (R1, R2), V1

•  Memory-bank conflicts?

Enhancing Vector Performance

•  Chaining: data-forwarding•  Conditional execution:

- Vector Mask Register - Some related instructions

•  SNEV V1, V2•  SGTSV F0, V1 •  CVM

•  Sparse matrices: scatter-gather

- LVI V1, (R1+V2) - SVI (R1+V2), V1

 Key Take-Away Ideas

•  Quantitative approach to design•  Amdahl's law•  Design to match technology trend•  Interface design•  Pipelining, non-uniformity is bad•  Golden rule: preserve programmer's view•  Complexity in hardware vs. software•  Caching: common across computer systems•  Caching + VM: notion of infinite resources•  Faster reads, postpone writes•  Where to place what mapping?•  Multiprocessing: affecting programmer's view•  Consistency models•  CAP principle

Page 115: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design

Lessons I Have Learnt  

•  Lessons on teaching:

- Teaching is quite different from learning - Good to teach outside topic of research

•  On tools for teaching:

- OpenOffice rocks! - Teaching on board still better for some topics

•  On student evaluations:

- Travel time is good to set papers - Group assignments better

 

 

 

 

 

Page 116: electronspick.files.wordpress.com  · Web viewComputer Architecture • "Architecture" - The art and science of designing and constructing buildings - A style and method of design