52
Lecture 1: Introduction Dr. Eng. Amr T. Abdel-Hamid Elect 707 Winter 2014 Computer Architecture Text book slides: Computer Architec ture: A Quantitative Approach 5 th E dition, John L. Hennessy & David A . Patterso with modifications.

Lecture 1: Introduction Microcomputer... · ISA vs. Computer Architecture Old definition of computer architecture = instruction set design Other aspects of computer design called

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • Lecture 1:

    Introduction

    Dr. Eng. Amr T. Abdel-Hamid

    Elect 707

    Winter 2014

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re

    Text book slides: Computer Architecture: A Quantitative Approach 5th Edition, John L. Hennessy & David A. Patterso with modifications.

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    CPU History in a Flash

    Intel 4004 (1971): 4-bit processor,

    2312 transistors, 0.4 MHz,

    10 micron PMOS, 11 mm2 chip

    • Processor is the new transistor?

    • RISC II (1983): 32-bit, 5 stage pipeline, 40,760 transistors, 3 MHz, 3 micron NMOS, 60 mm2 chip

    • 125 mm2 chip, 0.065 micron CMOS = 2312 RISC II+FPU+Icache+Dcache

    – RISC II shrinks to ~ 0.02 mm2 at 65 nm

    – Caches via DRAM or 1 transistor SRAM

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    Processor Performance In

    troductio

    n

    RISC

    Move to multi-processor

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    Snapdragon 805 processor specs

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    Classes of Computers

    Personal Mobile Device (PMD) e.g. start phones, tablet computers

    Emphasis on energy efficiency and real-time

    Desktop Computing Emphasis on price-performance

    Servers Emphasis on availability, scalability, throughput

    Clusters / Warehouse Scale Computers Used for “Software as a Service (SaaS)”

    Emphasis on availability and price-performance

    Sub-class: Supercomputers, emphasis: floating-point performance and fast internal networks

    Embedded Computers Emphasis: price

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    Instruction Set Architecture: Critical Interface

    instruction set

    software

    hardware

    Properties of a good abstraction Lasts through many generations (portability)

    Used in many different ways (generality)

    Provides convenient functionality to higher levels

    Permits an efficient implementation at lower levels

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    ISA vs. Computer Architecture

    Old definition of computer architecture

    = instruction set design

    Other aspects of computer design called implementation

    Insinuates implementation is uninteresting or less challenging

    Our view is: computer architecture >> ISA

    Architect’s job much more than instruction set design; techni

    cal hurdles today more challenging than those in instruction

    set design

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    Computer Architecture is Design and Analysis

    Design

    Analysis

    Architecture is an iterative process: • Searching the space of possible designs • At all levels of computer systems

    Creativity

    Good Ideas

    Mediocre Ideas Bad Ideas

    Cost / Performance Analysis

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    Administrivia

    Instructor: Dr. Amr T. Abdel-Hamid

    Office: C3-320

    Email: [email protected]

    Office Hours: Monday, 3rd & 4th Lectures.

    T. A: ???

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    Administrivia

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    Course Grading

    Exams

    Quizzes 3 Quizzes: best 2 10 %

    Mid Term 30 %

    Final exam 40 %

    Project 20%

    Assignments Not Graded

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    Project

    Phase 0: Select your partner (17/9/2014)

    Submit list of your group members (2-4 per group)

    Submit a Comparision between Risc and Cisc Proc.

    Phase 1:

    .

    .

    .

    .

    Phase N: Project Implementation + Report (2 weeks before fin

    als)

    FINAL Non-Negotiable deadline

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    In time & It is too LATE Policy

    In phases 0, & 1:

    5% of project grade penalty per day for being late

    In phase 2, to n:

    No late presentation is possible.

    Honor code

    100% penalty for both copier and copy-giver of

    Any Report/CODE.

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    1. Take Advantage of Parallelism

    2. Principle of Locality

    3. Focus on the Common Case

    4. Amdahl’s Law

    5. The Processor Performance Equation

    Quantitative Principles of Design

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    1) Taking Advantage of Parallelism

    Increasing throughput of server computer via multiple processors or multi

    ple disks

    Detailed HW design (DSD course shortly)

    Carry lookahead adders uses parallelism to speed up computing sum

    s from linear to logarithmic in number of bits per operand

    Multiple memory banks searched in parallel in set-associative caches

    Pipelining: overlap instruction execution to reduce the total time to compl

    ete an instruction sequence.

    Not every instruction depends on immediate predecessor executin

    g instructions completely/partially in parallel possible

    Classic 5-stage pipeline:

    1) Instruction Fetch (Ifetch),

    2) Register Read (Reg),

    3) Execute (ALU),

    4) Data Memory Access (Dmem),

    5) Register Write (Reg)

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    2) The Principle of Locality The Principle of Locality:

    Program access a relatively small portion of the address space at any instant of time.

    Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will tend to be r

    eferenced again soon (e.g., loops, reuse)

    Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straight-line code, array access)

    Last 30 years, HW relied on locality for memory perf.

    P MEM $

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    Levels of the Memory Hierarchy

    CPU Registers 100s Bytes 300 – 500 ps (0.3-0.5 ns)

    L1 and L2 Cache 10s-100s K Bytes ~1 ns - ~10 ns $1000s/ GByte

    Main Memory G Bytes 80ns- 200ns ~ $100/ GByte

    Disk 10s T Bytes, 10 ms (10,000,000 ns) ~ $1 / GByte

    Capacity Access Time Cost

    Tape infinite sec-min ~$1 / GByte

    Registers

    L1 Cache

    Memory

    Disk

    Tape

    Instr. Operands

    Blocks

    Pages

    Files

    Staging Xfer Unit

    prog./compiler 1-8 bytes

    cache cntl 32-64 bytes

    OS 4K-8K bytes

    user/operator Mbytes

    Upper Level

    Lower Level

    faster

    Larger

    L2 Cache cache cntl 64-128 bytes Blocks

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    3) Focus on the Common Case

    Common sense guides computer design Since its engineering, common sense is valuable

    In making a design trade-off, favor the frequent case over the infrequent case E.g., Instruction fetch and decode unit used more frequently th

    an multiplier, so optimize it 1st

    E.g., If database server has 50 disks / processor, storage dependability dominates system dependability, so optimize it 1st

    Frequent case is often simpler and can be done faster than the infrequent case E.g., overflow is rare when adding 2 numbers, so improve perf

    ormance by optimizing more common case of no overflow

    May slow down overflow, but overall performance improved by optimizing for the normal case

    What is frequent case and how much performance improved by making case faster => Amdahl’s Law

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    4) Amdahl’s Law

    enhanced

    enhancedenhanced

    new

    oldoverall

    Speedup

    Fraction Fraction

    1

    ExTime

    ExTime Speedup

    1

    Best you could ever hope to do:

    enhancedmaximum Fraction - 1

    1 Speedup

    enhanced

    enhancedenhancedoldnew Speedup

    FractionFraction ExTime ExTime 1

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    Amdahl’s Law example

    New CPU 10X faster

    I/O bound server, so 60% time waiting for I/O

    56.1

    64.0

    1

    10

    0.4 0.4 1

    1

    Speedup

    Fraction Fraction 1

    1 Speedup

    enhanced

    enhancedenhanced

    overall

    • Apparently, its human nature to be attracted by 10X faster, vs. keeping in perspective its just 1.6X faster

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    5) Processor performance equation

    CPU time = Seconds = Instructions x Cycles x Seconds

    Program Program Instruction Cycle

    Inst Count CPI Clock Rate

    Program X Compiler X (X) Inst. Set. X X Organization X X Technology X

    inst count CPI Cycle time

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    5 Steps of MIPS Datapath Figure A.2, Page A-8

    Memory Access

    Write Back

    Instruction Fetch

    Instr. Decode Reg. Fetch

    Execute Addr. Calc

    L M D

    ALU

    MU

    X

    Mem

    ory

    Reg F

    ile

    MU

    X MU

    X

    Data

    Mem

    ory

    MU

    X

    Sign Extend

    4

    Adder Zero? Next SEQ PC

    Address

    Next PC

    WB Data

    Inst

    RD

    RS1

    RS2

    Imm

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    5 Steps of MIPS Datapath Figure A.3, Page A-9

    Memory Access

    Write Back

    Instruction Fetch

    Instr. Decode Reg. Fetch

    Execute Addr. Calc

    ALU

    Mem

    ory

    Reg F

    ile

    MU

    X MU

    X

    Data

    Mem

    ory

    MU

    X

    Sign Extend

    Zero?

    IF/ID

    ID/E

    X

    ME

    M/W

    B

    EX

    /ME

    M

    4

    Adder

    Next SEQ PC Next SEQ PC

    RD RD RD WB

    Dat

    a

    Next PC

    Address

    RS1

    RS2

    Imm

    MU

    X

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    5 Steps of MIPS Datapath Figure A.3, Page A-9

    Memory Access

    Write Back

    Instruction Fetch

    Instr. Decode Reg. Fetch

    Execute Addr. Calc

    ALU

    Mem

    ory

    Reg F

    ile

    MU

    X MU

    X

    Data

    Mem

    ory

    MU

    X

    Sign Extend

    Zero?

    IF/ID

    ID/E

    X

    ME

    M/W

    B

    EX

    /ME

    M

    4

    Adder

    Next SEQ PC Next SEQ PC

    RD RD RD WB

    Dat

    a

    • Data stationary control – local decode for each instruction phase / pipeline stage

    Next PC

    Address

    RS1

    RS2

    Imm

    MU

    X

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    Visualizing Pipelining Figure A.2, Page A-8

    I n s t r.

    O r d e r

    Time (clock cycles)

    Reg

    ALU

    DMem Ifetch Reg

    Reg

    ALU

    DMem Ifetch Reg

    Reg

    ALU

    DMem Ifetch Reg

    Reg

    ALU

    DMem Ifetch Reg

    Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7 Cycle 5

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    Pipelining is not quite that easy!

    Limits to pipelining: Hazards prevent next instruction from exec

    uting during its designated clock cycle

    Structural hazards: HW cannot support this combination of instruc

    tions (single person to fold and put clothes away)

    Data hazards: Instruction depends on result of prior instruction stil

    l in the pipeline (missing sock)

    Control hazards: Caused by delay between the fetching of instruct

    ions and decisions about changes in control flow (branches and ju

    mps).

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    One Memory Port/Structural Hazards Figure A.4, Page A-14

    I n s t r.

    O r d e r

    Time (clock cycles)

    Load

    Instr 1

    Instr 2

    Instr 3

    Instr 4

    Reg

    ALU

    DMem Ifetch Reg

    Reg

    ALU

    DMem Ifetch Reg

    Reg

    ALU

    DMem Ifetch Reg

    Reg

    ALU

    DMem Ifetch Reg

    Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7 Cycle 5

    Reg

    ALU

    DMem Ifetch Reg

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    One Memory Port/Structural Hazards (Similar to Figure A.5, Page A-15)

    I n s t r.

    O r d e r

    Time (clock cycles)

    Load

    Instr 1

    Instr 2

    Stall

    Instr 3

    Reg

    ALU

    DMem Ifetch Reg

    Reg

    ALU

    DMem Ifetch Reg

    Reg

    ALU

    DMem Ifetch Reg

    Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7 Cycle 5

    Reg

    ALU

    DMem Ifetch Reg

    Bubble Bubble Bubble Bubble Bubble

    How do you “bubble” the pipe?

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    Speed Up Equation for Pipelining

    pipelined

    dunpipeline

    TimeCycle

    TimeCycle

    CPI stall Pipeline CPI Ideal

    depth Pipeline CPI Ideal Speedup

    pipelined

    dunpipeline

    TimeCycle

    TimeCycle

    CPI stall Pipeline 1

    depth Pipeline Speedup

    Instper cycles Stall Average CPI Ideal CPIpipelined

    For simple RISC pipeline, CPI = 1:

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    Example: Dual-port vs. Single-port

    Machine A: Dual ported memory (“Harvard Architecture”)

    Machine B: Single ported memory, but its pipelined implementatio

    n has a 1.05 times faster clock rate

    Ideal CPI = 1 for both

    Loads are 40% of instructions executed

    SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe)

    = Pipeline Depth

    SpeedUpB = Pipeline Depth/(1 + 0.4 x 1) x (clockunpipe/(clockunpipe / 1.05)

    = (Pipeline Depth/1.4) x 1.05

    = 0.75 x Pipeline Depth

    SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33

    Machine A is 1.33 times faster

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    I n s t r.

    O r d e r

    add r1,r2,r3

    sub r4,r1,r3

    and r6,r1,r7

    or r8,r1,r9

    xor r10,r1,r11

    Reg

    ALU

    DMem Ifetch Reg

    Reg

    ALU

    DMem Ifetch Reg

    Reg

    ALU

    DMem Ifetch Reg

    Reg

    ALU

    DMem Ifetch Reg

    Reg

    ALU

    DMem Ifetch Reg

    Data Hazard on R1 Figure A.6, Page A-17

    Time (clock cycles)

    IF ID/RF EX MEM WB

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    Read After Write (RAW)

    InstrJ tries to read operand before InstrI writes it

    Caused by a “Dependence” (in compiler nomenclature). This ha

    zard results from an actual need for communication.

    Three Generic Data Hazards

    I: add r1,r2,r3

    J: sub r4,r1,r3

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    Write After Read (WAR) InstrJ writes operand before InstrI reads it

    Called an “anti-dependence” by compiler writers. This results from reuse of the name “r1”.

    Can’t happen in MIPS 5 stage pipeline because:

    All instructions take 5 stages, and

    Reads are always in stage 2, and

    Writes are always in stage 5

    I: sub r4,r1,r3

    J: add r1,r2,r3

    K: mul r6,r1,r7

    Three Generic Data Hazards

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    Three Generic Data Hazards

    Write After Write (WAW)

    InstrJ writes operand before InstrI writes it.

    Called an “output dependence” by compiler writers

    This also results from the reuse of name “r1”.

    Can’t happen in MIPS 5 stage pipeline because:

    All instructions take 5 stages, and

    Writes are always in stage 5

    Will see WAR and WAW in more complicated pipes

    I: sub r1,r4,r3

    J: add r1,r2,r3

    K: mul r6,r1,r7

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    Time (clock cycles)

    Forwarding to Avoid Data Hazard Figure A.7, Page A-19

    I n s t r.

    O r d e r

    add r1,r2,r3

    sub r4,r1,r3

    and r6,r1,r7

    or r8,r1,r9

    xor r10,r1,r11

    Reg

    ALU

    DMem Ifetch Reg

    Reg

    ALU

    DMem Ifetch Reg

    Reg

    ALU

    DMem Ifetch Reg

    Reg

    ALU

    DMem Ifetch Reg

    Reg

    ALU

    DMem Ifetch Reg

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    HW Change for Forwarding Figure A.23, Page A-37

    ME

    M/W

    R

    ID/E

    X

    EX

    /ME

    M

    Data Memory

    ALU

    mux

    m

    ux

    Registers

    NextPC

    Immediate

    mux

    What circuit detects and resolves this hazard?

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    38

    Time (clock cycles)

    Forwarding to Avoid LW-SW Data Hazard Figure A.8, Page A-20

    I n s t r.

    O r d e r

    add r1,r2,r3

    lw r4, 0(r1)

    sw r4,12(r1)

    or r8,r6,r9

    xor r10,r9,r11

    Reg

    ALU

    DMem Ifetch Reg

    Reg

    ALU

    DMem Ifetch Reg

    Reg

    ALU

    DMem Ifetch Reg

    Reg

    ALU

    DMem Ifetch Reg

    Reg

    ALU

    DMem Ifetch Reg

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    Time (clock cycles)

    I n s t r.

    O r d e r

    lw r1, 0(r2)

    sub r4,r1,r6

    and r6,r1,r7

    or r8,r1,r9

    Data Hazard Even with Forwarding Figure A.9, Page A-21

    Reg

    ALU

    DMem Ifetch Reg

    Reg

    ALU

    DMem Ifetch Reg

    Reg ALU

    DMem Ifetch Reg

    Reg

    ALU

    DMem Ifetch Reg

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    Data Hazard Even with Forwarding (Similar to Figure A.10, Page A-21)

    Time (clock cycles)

    or r8,r1,r9

    I n s t r.

    O r d e r

    lw r1, 0(r2)

    sub r4,r1,r6

    and r6,r1,r7

    Reg

    ALU

    DMem Ifetch Reg

    Reg Ifetch

    ALU

    DMem Reg Bubble

    Ifetch

    ALU

    DMem Reg Bubble Reg

    Ifetch

    ALU

    DMem Bubble Reg

    How is this detected?

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    Control Hazard on Branches

    Three Stage Stall

    10: beq r1,r3,36

    14: and r2,r3,r5

    18: or r6,r1,r7

    22: add r8,r1,r9

    36: xor r10,r1,r11

    Reg ALU

    DMem Ifetch Reg

    Reg

    ALU

    DMem Ifetch Reg

    Reg

    ALU

    DMem Ifetch Reg

    Reg

    ALU

    DMem Ifetch Reg

    Reg

    ALU

    DMem Ifetch Reg

    What do you do with the 3 instructions in between? How do you do it?

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    Branch Stall Impact

    If CPI = 1, 30% branch,

    Stall 3 cycles => new CPI = 1.9!

    Two part solution:

    Determine branch taken or not sooner, AND

    Compute taken branch address earlier

    MIPS branch tests if register = 0 or 0

    MIPS Solution:

    Move Zero test to ID/RF stage

    Adder to calculate new PC in ID/RF stage

    1 clock cycle penalty for branch versus 3

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    Adder

    IF/ID

    Pipelined MIPS Datapath Figure A.24, page A-38

    Memory Access

    Write Back

    Instruction Fetch

    Instr. Decode Reg. Fetch

    Execute Addr. Calc

    ALU

    Mem

    ory

    Reg F

    ile

    MU

    X

    Data

    Mem

    ory

    MU

    X

    Sign Extend

    Zero?

    ME

    M/W

    B

    EX

    /ME

    M

    4

    Adder

    Next SEQ PC

    RD RD RD WB

    Dat

    a

    • Interplay of instruction set design and cycle time.

    Next PC

    Address

    RS1

    RS2

    Imm M

    UX

    ID/E

    X

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    Current Trends in Architecture

    Cannot continue to leverage Instruction-Level parallelism (ILP) Single processor performance improvement ended in

    2003

    New models for performance: Data-level parallelism (DLP)

    Thread-level parallelism (TLP)

    Request-level parallelism (RLP)

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    Parallelism

    Classes of parallelism in applications: Data-Level Parallelism (DLP)

    Task-Level Parallelism (TLP)

    Classes of architectural parallelism: Instruction-Level Parallelism (ILP)

    Vector architectures/Graphic Processor Units (GPUs)

    Thread-Level Parallelism

    Request-Level Parallelism

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    Trends in Technology

    Integrated circuit technology Transistor density: 35%/year

    Die size: 10-20%/year

    Integration overall: 40-55%/year

    DRAM capacity: 25-40%/year (slowing)

    Flash capacity: 50-60%/year 15-20X cheaper/bit than DRAM

    Magnetic disk technology: 40%/year 15-25X cheaper/bit then Flash

    300-500X cheaper/bit than DRAM

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    Bandwidth and Latency

    Bandwidth or throughput Total work done in a given time

    10,000-25,000X improvement for processors

    300-1200X improvement for memory and disks

    Latency or response time Time between start and completion of an event

    30-80X improvement for processors

    6-8X improvement for memory and disks

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    Power and Energy

    Problem: Get power in, get power out

    Thermal Design Power (TDP) Characterizes sustained power consumption

    Used as target for power supply and cooling system

    Lower than peak power, higher than average power consumption

    Clock rate can be reduced dynamically to limit power consumption

    Energy per task is often a better measurement

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    Dynamic Energy and Power

    Dynamic energy Transistor switch from 0 -> 1 or 1 -> 0

    ½ x Capacitive load x Voltage2

    Dynamic power ½ x Capacitive load x Voltage2 x Frequency switched

    Reducing clock rate reduces power, not energy

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    Power

    Intel 80386 consumed ~ 2 W

    3.3 GHz Intel Core i7 consumes 130 W

    Heat must be dissipated from 1.5 x 1.5 cm chip

    This is the limit of what can be cooled by air

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    Reducing Power

    Techniques for reducing power:

    Do nothing well

    Dynamic Voltage-Frequency Scaling

    Low power state for DRAM, disks

    turning off cores

  • Dr. A

    mr T

    ala

    at

    Elect 707

    Co

    mp

    ute

    r Arc

    hite

    ctu

    re a

    nd

    Ap

    plic

    atio

    ns

    Reading Assignment:

    Chapter 1, Appendix B