CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383

Preview:

Citation preview

CSE 8383 - Advanced Computer Architecture

Week-4Week of Feb 2, 2004

engr.smu.edu/~rewini/8383

Contents Reservation Table Latency Analysis State Diagrams MAL and its bounds Delay Insertion Throughput Group Work Introduction to Multiprocessors

Reservation Table A reservation table displays the time-

space flow of data through the pipeline for one function evaluation

A static pipeline is specified by a single reservation table

A dynamic pipeline may be specified by multiple reservation tables

Static Pipeline

X

X

X

X

S1

S2

S3

S4

Time

Dynamic Pipeline

X X X

X X

X X X

Y Y

Y

Y Y Y

S1

S2

S3

S1

S2

S3

Reservation Table (Cont.) The number of columns in a reservation

table is called the evaluation time of a given function.

The checkmarks in a row correspond to the time instants (cycles) that a particular stage will be used.

Multiple checkmarks in a row repeated usage of the same stage in different cycles

Reservation Table (Cont.) Contiguous checkmarks

extended usage of a stage over more than one cycle

Multiple checkmarks in one column multiple stages are used in parallel

A dynamic pipeline may allow different initiations to follow a mix of reservation table

Reservation Table

1 2 3 4 5 6 7

A X X X

B X X

C X X

D X

Latency Analysis The number of cycles between two

initiations is the latency between them

A latency of k two initiations are separated by k cycles

Collision resource conflict between two initiations

Latencies that cause collision forbidden latencies

Collision with latency 2 & 5 in evaluating X

X1 X2 X1 X2 X1

X1 X2 X1 X2

X1 X2 X1

X2 X1

S1

S2

S3

X1 X2 X1 X1

X1 X1 X2

X1 X1 X1 X2

S1

S2

S3

5

2

Latency Analysis (cont.) Latency Sequence a sequence of

permissible latencies between successive initiations

Latency Cycle a latency sequence that repeats the same subsequence (cycle) indefinitely

Latency Sequence 1, 8 Latencies Cycle (1,8) 1, 8, 1, 8, 1,

8 …

Latency Analysis (cont.) Average Latency (of a latency

cycle) sum of all latencies / number of latencies along the cycle

Constant Cycle One latency value

Objective Obtain the shortest average latency between initiations without causing collisions.

Latency Cycle (1,8)

1 2 3 4 5 6 7 8 9 10

11 12 13

14 15 16

17 18 19

20

21

X1

X2

X1

X2

X1

X2

X3

X4

X3

X4

X3

X4

X5

X6

X1

X2

X1

X2

X3

X4

X3

X4

X5

X6

X1

X2

X1

X2

X1

X2

X3

X4

X3

X4

X3

X4

X5

Average Latency = (1+8)/2 = 4.5

Latency Cycle (6)

1 2 3 4 5 6 7 8 9 10

11 12 13

14 15 16

17 18 19

20

21

X1

X1

X2

X1

X2

X3

X2

X 3

X4

X3

X1

X1

X2

X2

X3

X3

X4

X1

X1

X1

X2

X2

X2

X3

X3

X3

X4

Average Latency = 6

Collision VectorC = (Cm, Cm-1, …, C2, C1)

Ci = 1 if latency i causes collision (forbidden)

Ci = 0 if latency i is permissible

Cm = 1 (always) maximum forbidden latency

Maximum forbidden latency: m <= n-1n = number of column in reservation table

Collision Vector (X after X) Forbidden Latencies: 2, 4, 5, 7 Collision Vector = 1 0 1 1 0 1 0

Collision Vector (Y after Y) Forbidden Latencies: 2, 4 Collision Vector = 1 0 1 0

State Diagram It specifies the permissible state

transitions among successive initiations

Collision vector corresponds to the initial state at time t = 1 (initial collision vector)

The next state comes at time t + p, where p is a permissible latency in the range 1 <= p < m

Right Shift Register

The next state can be obtained with the help of an m-bit shift register

0

0

1 Collision

Safe to allow an initiation

Each 1-bit shift corresponds to increase in the latency by 1

The next state The next state is obtained by

bitwise ORing the initial collision vector with the shifted register

C.V. = 1 0 1 1 0 1 0 (first state)0 1 0 1 1 0 1 C.V. 1-bit right shifted

1 0 1 1 0 1 0 initial C.V.---------------- OR

1 1 1 1 1 1 1

State Diagram for X

1 0 1 1 0 1 0

1 1 1 1 1 1 11 0 1 1 0 1 1

36 8+

6

8+

8+

3*

1*

Cycles Simple cycles each state

appears only once(3), (6), (8), (1, 8), (3, 8), and (6,8) Greedy Cycles simple cycles

whose edges are all made with minimum latencies from their respective starting states

(1,8), (3) one of them is MAL

MAL Minimum Average latency At least one of the greedy cycles

will lead to the MAL Consider state diagram for Y, MAL

is 3 (See diagram)

State Diagram for Y

1 0 1 0

1 1 1 11 0 1 1 0 1 1

35+

5+

5+

3*

1*

Bounds on the MAL MAL is lower bounded by the maximum

number of checkmarks in any row of the reservation table. (Shar, 1972)

MAL is lower than or equal to the average latency of any greedy cycle in the state diagram. (Shar, 1972)

The average latency of any greedy cycle is upper-bounded by the number of 1’s in the initial collision vector plus 1. This is also an upper bund on the MAL. (Shar, 1972)

Delay Insertion The purpose is to modify the

reservation table, yielding a new collision vector

This may lead to a modified state diagram, which may produce greedy cycles meeting the lower bound on MAL

Example

S1 S2 S3

output

Example (Cont.)

1 2 3 4 5

S1 X X

S2 X X

S3 X X

Forbidden Latencies: 1, 2, 4C.V. 1 0 1 1

Example (Cont.) State Diagram

1 0 1 13*

5+

MAL = 3

Example (Cont.)

S1 S2 S3

outputD1

D2

Example (Cont.)

1 2 3 4 5 6 7

S1 X X

S2 X X

S3 X X

D1 X

D2 X

Forbidden: 2, 6C.V. 1 0 0 0 1 0

Group Activity 1

Find the State Diagram

Pipeline Throughput The average number of task

initiations per clock cycle

The inverse of MAL

Group Activity 2

1 2 3 4

S1 X X

S2 X

S3 X

C.V State Diagram Simple Cycles

Greedy Cycles MAL Throughput (t = 20 ns)

Multiprocessors

Introduction Uniprocessor systems are not capable

of delivering solutions to some problems in reasonable time

Multiple processors cooperate to jointly execute a single computational task in order to speed up its execution

Speed-up versus Quality-up

Architecture Background Three major Components

Processors

Memory Modules

Interconnection Network

Parallel and Distributed Computers MIMD Shared Memory

Bus based Switch based CC-NUMA

MIMD Distributed Memory SIMD Computers Clusters Grid Computing

MIMD Shared Memory Systems

Interconnection Networks

M M M M

P P P P P

Bus Based & switch based SM Systems

Global Memory

P

C

P

C

P

C

P C

P C

P C

P C

M M M M

Cache Coherent NUMA

Interconnection Network

M

C

P

M

C

P

M

C

P

M

C

P

MIMD Distributed Memory Systems

Interconnection Networks

M M M M

P P P P

SIMD Computers

Processor

Memory

PM

PM

PM

PM

PM

PM

PM

PM

PM

PM

PM

PM

PM

PM

PM

PM

von Neumann Computer

Some Interconnection Network

Clusters

M

C

P

I/O

OS

M

C

P

I/O

OS

M

C

P

I/O

OS

Middleware

Programming Environment

Interconnection Network

Grids Grids are geographically

distributed platforms for computation.

They provide dependable, consistent, pervasive, and inexpensive access to high end computational capabilities.

Interconnection Network Taxonomy

Interconnection Network

Static Dynamic

Bus-based Switch-based1-D 2-D HC

Single Multiple SS MS Crossbar

Recommended