Final Review

Final Review

Dr. Bernard Chen Ph.D.University of Central Arkansas

Fall 2010

Overcome Data Hazards with Dynamic Scheduling Key idea: Allow instructions behind

stall to proceedDIV F0 <- F2/F4ADD F10<- F0+F8SUB F12<- F8-F14

Overcome Data Hazards with Dynamic Scheduling Key idea: Allow instructions behind

stall to proceedDIV F0 <- F2/F4SUB F12<- F8-F14ADD F10<- F0+F8

Overcome Data Hazards with Dynamic Scheduling Key idea: Allow instructions behind stall to

proceedDIV F0 <- F2/F4SUB F12<- F8-F14ADD F10<- F0+F8

Enables out-of-order execution and allows out-of-order completion (e.g., SUB)

In a dynamically scheduled pipeline, all instructions still pass through issue stage in order (in-order issue)

Overcome Data Hazards with Dynamic Scheduling However, Dynamic execution creates WAR and WAW hazards and makes exceptions harder

Name dependence: when 2 instructions use same register or memory location, called a name, but no flow of data between the instructions associated with that name;

There are 2 versions of name dependence

WAR InstrJ writes operand before InstrI

reads it If it caused a hazard in the

pipeline, called a Write After Read (WAR) hazard

I: sub r4,r1,r3 J: add r1,r2,r3K: mul r6,r1,r7

WAW InstrJ writes operand before InstrI

writes it. If anti-dependence caused a

hazard in the pipeline, called a Write After Write (WAW) hazard

I: sub r1,r4,r3 J: add r1,r2,r3K: mul r6,r1,r7

Thread-level parallelism (TLP) Thread: process with own instructions

and data thread may be a process part of a parallel

program of multiple processes, or it may be an independent program

Each thread has all the state (instructions, data, PC, register state, and so on) necessary to allow it to execute

(Ch4: Data Level Parallelism: Perform identical operations on data, and lots of data)

New Approach: Mulithreaded Execution Multithreading: multiple threads to

share the functional units of 1 processor via overlapping

Processor must duplicate independent state of each thread e.g., a separate copy of register file, a separate PC, and for running independent programs, a separate page table

New Approach: Mulithreaded Execution When switch?

Alternate instruction per thread (fine grain)

When a thread is stalled, perhaps for a cache miss, another thread can be executed (coarse grain)

Fine-Grained Multithreading Switches between threads on each

instruction, causing the execution of multiples threads to be interleaved

Usually done in a round-robin fashion, skipping any stalled threads

CPU must be able to switch threads every clock

Course-Grained Multithreading Switches threads only on costly stalls,

such as L2 cache misses Advantages

Relieves need to have very fast thread-switching

Doesn’t slow down thread, since instructions from other threads issued only when the thread encounters a costly stall

Course-Grained Multithreading Disadvantage is hard to overcome throughput

losses from shorter stalls, due to pipeline start-up costs Since CPU issues instructions from 1 thread,

when a stall occurs, the pipeline must be emptied or frozen

New thread must fill pipeline before instructions can complete

Because of this start-up overhead, coarse-grained multithreading is better for reducing penalty of high cost stalls, where pipeline refill << stall time

Multithreaded Categories

Thread 1 Thread 2 Thread 3 Thread 4

Thread 5

Multithreaded Categories

Tim

e (p

roce

ssor

cy

cle)

Superscalar Fine-Grained Coarse-Grained (2clock cycle)

Thread 1

Thread 2Thread 3Thread 4

Thread 5Idle slot

Flynn’s Taxonomy

M.J. Flynn, "Very High-Speed Computers", Proc. of the IEEE, V 54, 1900-1909, Dec. 1966.

Single Instruction Single Data (SISD)(Uniprocessor)

Single Instruction Multiple Data SIMD(single PC/Server)

Multiple Instruction Single Data (MISD)(????)

Multiple Instruction Multiple Data MIMD(Clusters, SMP servers)

Back to Basics “A parallel computer is a collection of processing

elements that cooperate and communicate to solve large problems fast.”

Parallel Architecture = Computer Architecture + Communication Architecture

2 classes of multiprocessors WRT memory:1. Centralized Memory Multiprocessor

• < few dozen processor chips Small enough to share single, centralized memory

2. Physically Distributed-Memory multiprocessor• Larger number chips and cores• BW demands Memory distributed among processors

2 Models for Communication and Memory Architecture The first kind, communication

occurs through a shared address space.

Centralized memory processor utilized this type of communication, named symmetric shared memory multiprocessors

2 Models for Communication and Memory Architecture The first kind, communication occurs

through a shared address space

Even the physically separate memories can be addressed as on logically shared space Meaning that the memory reference can be

made by any processor to any memory location, (assume it has the access right)

These multiprocessors are called distributed shared memory (DSM)

2 Models for Communication and Memory Architecture1. Communication occurs through a shared

address space (via loads and stores): shared memory multiprocessors either

• symmetric shared memory (centralized memory MP)

• distributed shared memory (distributed memory MP)

2. Communication occurs by explicitly passing messages among the processors: message-passing multiprocessors, distributed memory MP

Multiprocessors Performance Amdahl’s Law

2 Classes of Cache Coherence Protocols1. Snooping — Every cache with a

copy of data also has a copy of sharing status of block, but no centralized state is kept

2. Directory based — Sharing status of a block of physical memory is kept in just one location, the directory

Snooping Write through: the information is written

to both the block in the cache and to the block in the lower-level memory

Write back: the information is only to the block in the cache. The modified cache block is written to main memory only when it is replaced or needed

Snooping (write back) Time Processor

activityBus activity

Contents of CPU A’s cache


Contents of memory X

0 0

1 CPU A read X

Cache miss for X

0 0

2 CPU B read X

Cache miss for X

0 0 0

3 CPU A write 1 to X

Invalidation for X

1 0

4 CPU B reads X

Cache miss for X

1 1 1

Snooping (write through) Time Processor

activityBus activity



Contents of memory X

0 0

1 CPU A read X

Cache miss for X

0 0

2 CPU B read X

Cache miss for X

0 0 0

3 CPU A write 1 to X

Invalidation for X

1 1

4 CPU B reads X

Cache miss for X

1 1 1

Directory-Based Cache Coherence Protocols To implement the operations, a directory must

track the state of each cache block:

Shared (S): one or more processors have the block cached, and the value is up-to-date

Uncached (U): no processor has a copy of the cache block

Modified/Executed (E): exactly one processor has a copy of the cache block. The processor is called the owner of the block

Directory-based ProtocolDirectory-based ProtocolInterconnection Network

CPU 0 CPU 1 CPU 2

7X

Caches

Memories

Directories X U 0 0 0

Bit Vector

CPU 0 Reads XCPU 0 Reads XInterconnection Network

CPU 0 CPU 1 CPU 2

7X

Caches

Memories

Directories X S 1 0 0

7X


CPU 0 CPU 1 CPU 2

7X

Caches

Memories


7X 7X

CPU 0 Writes 6 to XCPU 0 Writes 6 to XInterconnection Network

CPU 0 CPU 1 CPU 2

7X

Caches

Memories

Directories X E 1 0 0

6X


CPU 0 CPU 1 CPU 2

6X

Caches

Memories


6X 6X

CPU 2 Writes 5 to X CPU 2 Writes 5 to X (Write back)(Write back)

Interconnection Network

CPU 0 CPU 1 CPU 2

6X

Caches

Memories


5X

CPU 0 Writes 4 to XCPU 0 Writes 4 to XInterconnection Network

CPU 0 CPU 1 CPU 2

5X

Caches

Memories


4X

Evaluating Switch Topologies Diameter Diameter

distance between farthest two nodesdistance between farthest two nodes Bisection widthBisection width

Min. number of edges in a cut which roughly Min. number of edges in a cut which roughly divides a network in two halves - determines divides a network in two halves - determines the min. bandwidth of the networkthe min. bandwidth of the network

Degree = Number of edges / node Degree = Number of edges / node constant degree board can be mass producedconstant degree board can be mass produced

Constant edge length? (yes/no)Constant edge length? (yes/no)

2-D Mesh Network

Binary Tree Network

Hypercube2 2 xx 2 2 xx … … xx 2 mesh 2 mesh

0010

0000

0100

0110 0111

1110

0001

0101

1000 1001

0011

1010

1111

1011

11011100

Hypercubes Illustrated

Butterfly Network0 1 2 3 4 5 6 7

3 ,0 3 ,1 3 ,2 3 ,3 3 ,4 3 ,5 3 ,6 3 ,7

2 ,0 2 ,1 2 ,2 2 ,3 2 ,4 2 ,5 2 ,6 2 ,7

1 ,0 1 ,1 1 ,2 1 ,3 1 ,4 1 ,5 1 ,6 1 ,7

0 ,0 0 ,1 0 ,2 0 ,3 0 ,4 0 ,5 0 ,6 0 ,7R ank 0

R ank 1

R ank 2

R ank 3

Documents

Final Review