41
Final Review Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010

Final Review

  • Upload
    ronli

  • View
    26

  • Download
    0

Embed Size (px)

DESCRIPTION

Final Review. Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010. Overcome Data Hazards with Dynamic Scheduling. Key idea: Allow instructions behind stall to proceed DIV F0

Citation preview

Page 1: Final Review

Final Review

Dr. Bernard Chen Ph.D.University of Central Arkansas

Fall 2010

Page 2: Final Review

Overcome Data Hazards with Dynamic Scheduling Key idea: Allow instructions behind

stall to proceedDIV F0 <- F2/F4ADD F10<- F0+F8SUB F12<- F8-F14

Page 3: Final Review

Overcome Data Hazards with Dynamic Scheduling Key idea: Allow instructions behind

stall to proceedDIV F0 <- F2/F4SUB F12<- F8-F14ADD F10<- F0+F8

Page 4: Final Review

Overcome Data Hazards with Dynamic Scheduling Key idea: Allow instructions behind stall to

proceedDIV F0 <- F2/F4SUB F12<- F8-F14ADD F10<- F0+F8

Enables out-of-order execution and allows out-of-order completion (e.g., SUB)

In a dynamically scheduled pipeline, all instructions still pass through issue stage in order (in-order issue)

Page 5: Final Review

Overcome Data Hazards with Dynamic Scheduling However, Dynamic execution creates WAR and WAW hazards and makes exceptions harder

Name dependence: when 2 instructions use same register or memory location, called a name, but no flow of data between the instructions associated with that name;

There are 2 versions of name dependence

Page 6: Final Review

WAR InstrJ writes operand before InstrI

reads it If it caused a hazard in the

pipeline, called a Write After Read (WAR) hazard

I: sub r4,r1,r3 J: add r1,r2,r3K: mul r6,r1,r7

Page 7: Final Review

WAW InstrJ writes operand before InstrI

writes it. If anti-dependence caused a

hazard in the pipeline, called a Write After Write (WAW) hazard

I: sub r1,r4,r3 J: add r1,r2,r3K: mul r6,r1,r7

Page 8: Final Review

Thread-level parallelism (TLP) Thread: process with own instructions

and data thread may be a process part of a parallel

program of multiple processes, or it may be an independent program

Each thread has all the state (instructions, data, PC, register state, and so on) necessary to allow it to execute

(Ch4: Data Level Parallelism: Perform identical operations on data, and lots of data)

Page 9: Final Review

New Approach: Mulithreaded Execution Multithreading: multiple threads to

share the functional units of 1 processor via overlapping

Processor must duplicate independent state of each thread e.g., a separate copy of register file, a separate PC, and for running independent programs, a separate page table

Page 10: Final Review

New Approach: Mulithreaded Execution When switch?

Alternate instruction per thread (fine grain)

When a thread is stalled, perhaps for a cache miss, another thread can be executed (coarse grain)

Page 11: Final Review

Fine-Grained Multithreading Switches between threads on each

instruction, causing the execution of multiples threads to be interleaved

Usually done in a round-robin fashion, skipping any stalled threads

CPU must be able to switch threads every clock

Page 12: Final Review

Course-Grained Multithreading Switches threads only on costly stalls,

such as L2 cache misses Advantages

Relieves need to have very fast thread-switching

Doesn’t slow down thread, since instructions from other threads issued only when the thread encounters a costly stall

Page 13: Final Review

Course-Grained Multithreading Disadvantage is hard to overcome throughput

losses from shorter stalls, due to pipeline start-up costs Since CPU issues instructions from 1 thread,

when a stall occurs, the pipeline must be emptied or frozen

New thread must fill pipeline before instructions can complete

Because of this start-up overhead, coarse-grained multithreading is better for reducing penalty of high cost stalls, where pipeline refill << stall time

Page 14: Final Review

Multithreaded Categories

Thread 1 Thread 2 Thread 3 Thread 4

Thread 5

Page 15: Final Review

Multithreaded Categories

Tim

e (p

roce

ssor

cy

cle)

Superscalar Fine-Grained Coarse-Grained (2clock cycle)

Thread 1

Thread 2Thread 3Thread 4

Thread 5Idle slot

Page 16: Final Review

Flynn’s Taxonomy

M.J. Flynn, "Very High-Speed Computers", Proc. of the IEEE, V 54, 1900-1909, Dec. 1966.

Single Instruction Single Data (SISD)(Uniprocessor)

Single Instruction Multiple Data SIMD(single PC/Server)

Multiple Instruction Single Data (MISD)(????)

Multiple Instruction Multiple Data MIMD(Clusters, SMP servers)

Page 17: Final Review

Back to Basics “A parallel computer is a collection of processing

elements that cooperate and communicate to solve large problems fast.”

Parallel Architecture = Computer Architecture + Communication Architecture

2 classes of multiprocessors WRT memory:1. Centralized Memory Multiprocessor

• < few dozen processor chips Small enough to share single, centralized memory

2. Physically Distributed-Memory multiprocessor• Larger number chips and cores• BW demands Memory distributed among processors

Page 18: Final Review
Page 19: Final Review
Page 20: Final Review

2 Models for Communication and Memory Architecture The first kind, communication

occurs through a shared address space.

Centralized memory processor utilized this type of communication, named symmetric shared memory multiprocessors

Page 21: Final Review

2 Models for Communication and Memory Architecture The first kind, communication occurs

through a shared address space

Even the physically separate memories can be addressed as on logically shared space Meaning that the memory reference can be

made by any processor to any memory location, (assume it has the access right)

These multiprocessors are called distributed shared memory (DSM)

Page 22: Final Review

2 Models for Communication and Memory Architecture1. Communication occurs through a shared

address space (via loads and stores): shared memory multiprocessors either

• symmetric shared memory (centralized memory MP)

• distributed shared memory (distributed memory MP)

2. Communication occurs by explicitly passing messages among the processors: message-passing multiprocessors, distributed memory MP

Page 23: Final Review

Multiprocessors Performance Amdahl’s Law

Page 24: Final Review

2 Classes of Cache Coherence Protocols1. Snooping — Every cache with a

copy of data also has a copy of sharing status of block, but no centralized state is kept

2. Directory based — Sharing status of a block of physical memory is kept in just one location, the directory

Page 25: Final Review

Snooping Write through: the information is written

to both the block in the cache and to the block in the lower-level memory

Write back: the information is only to the block in the cache. The modified cache block is written to main memory only when it is replaced or needed

Page 26: Final Review

Snooping (write back) Time Processor

activityBus activity

Contents of CPU A’s cache

Contents of CPU A’s cache

Contents of memory X

0 0

1 CPU A read X

Cache miss for X

0 0

2 CPU B read X

Cache miss for X

0 0 0

3 CPU A write 1 to X

Invalidation for X

1 0

4 CPU B reads X

Cache miss for X

1 1 1

Page 27: Final Review

Snooping (write through) Time Processor

activityBus activity

Contents of CPU A’s cache

Contents of CPU A’s cache

Contents of memory X

0 0

1 CPU A read X

Cache miss for X

0 0

2 CPU B read X

Cache miss for X

0 0 0

3 CPU A write 1 to X

Invalidation for X

1 1

4 CPU B reads X

Cache miss for X

1 1 1

Page 28: Final Review

Directory-Based Cache Coherence Protocols To implement the operations, a directory must

track the state of each cache block:

Shared (S): one or more processors have the block cached, and the value is up-to-date

Uncached (U): no processor has a copy of the cache block

Modified/Executed (E): exactly one processor has a copy of the cache block. The processor is called the owner of the block

Page 29: Final Review

Directory-based ProtocolDirectory-based ProtocolInterconnection Network

CPU 0 CPU 1 CPU 2

7X

Caches

Memories

Directories X U 0 0 0

Bit Vector

Page 30: Final Review

CPU 0 Reads XCPU 0 Reads XInterconnection Network

CPU 0 CPU 1 CPU 2

7X

Caches

Memories

Directories X S 1 0 0

7X

Page 31: Final Review

CPU 2 Reads XCPU 2 Reads XInterconnection Network

CPU 0 CPU 1 CPU 2

7X

Caches

Memories

Directories X S 1 0 1

7X 7X

Page 32: Final Review

CPU 0 Writes 6 to XCPU 0 Writes 6 to XInterconnection Network

CPU 0 CPU 1 CPU 2

7X

Caches

Memories

Directories X E 1 0 0

6X

Page 33: Final Review

CPU 1 Reads XCPU 1 Reads XInterconnection Network

CPU 0 CPU 1 CPU 2

6X

Caches

Memories

Directories X S 1 1 0

6X 6X

Page 34: Final Review

CPU 2 Writes 5 to X CPU 2 Writes 5 to X (Write back)(Write back)

Interconnection Network

CPU 0 CPU 1 CPU 2

6X

Caches

Memories

Directories X E 0 0 1

5X

Page 35: Final Review

CPU 0 Writes 4 to XCPU 0 Writes 4 to XInterconnection Network

CPU 0 CPU 1 CPU 2

5X

Caches

Memories

Directories X E 1 0 0

4X

Page 36: Final Review

Evaluating Switch Topologies Diameter Diameter

distance between farthest two nodesdistance between farthest two nodes Bisection widthBisection width

Min. number of edges in a cut which roughly Min. number of edges in a cut which roughly divides a network in two halves - determines divides a network in two halves - determines the min. bandwidth of the networkthe min. bandwidth of the network

Degree = Number of edges / node Degree = Number of edges / node constant degree board can be mass producedconstant degree board can be mass produced

Constant edge length? (yes/no)Constant edge length? (yes/no)

Page 37: Final Review

2-D Mesh Network

Page 38: Final Review

Binary Tree Network

Page 39: Final Review

Hypercube2 2 xx 2 2 xx … … xx 2 mesh 2 mesh

0010

0000

0100

0110 0111

1110

0001

0101

1000 1001

0011

1010

1111

1011

11011100

Page 40: Final Review

Hypercubes Illustrated

Page 41: Final Review

Butterfly Network0 1 2 3 4 5 6 7

3 ,0 3 ,1 3 ,2 3 ,3 3 ,4 3 ,5 3 ,6 3 ,7

2 ,0 2 ,1 2 ,2 2 ,3 2 ,4 2 ,5 2 ,6 2 ,7

1 ,0 1 ,1 1 ,2 1 ,3 1 ,4 1 ,5 1 ,6 1 ,7

0 ,0 0 ,1 0 ,2 0 ,3 0 ,4 0 ,5 0 ,6 0 ,7R ank 0

R ank 1

R ank 2

R ank 3