Parallel Programming Platforms

1

2/11/2003 platforms 1

Parallel Programming Platforms

CS 442/EECE 432Brian T. Smith, UNM, CS Dept.

Spring, 2003


The Standard Serial Model• Von Neumann model

• called by Flynn's taxonomy a single instruction single data stream (SISD)

– the programming model is:

• one sequential list of instructions of the formloop until finished

fetch instructiondecode instructionfetch operandsexecute instruction

end loop

2


Typical Von Neumann Machines And Their Evolution

processor memoryA simple (early) sequential computer

A sequential computer with memory interleaving processor memory bank 2

memory bank 1

memory bank 3

A sequential computer with cacheand memory interleaving cache memory bank 2

memory bank 1

memory bank 3

processor

Replacing the CPU with a pipelined processor with d stages sd–1, sd–2,…,s0

… … …sd–2 s0sd–1 …


Avoidance Of BottlenecksA Bottleneck• Many accesses to

memory• Memory latency

too long

• CPU too slow

The Solution• Multiple memory banks

• caches with blocks of size 1• caches with larger blocks

• pipelined execution units• multiple execution units

• fused add/multiply

• multiple processors sharing one memory

3


The Result• All of these solutions represent implicit parallelism: that is,

• Parallelism implemented in/by the hardware

• The programmer has no way to specify/control its use -- available full-time

• The program, by the way the program is written or the particular algorithm used, benefits from parallelism or inhibits the parallel execution

• For the special case of the use of data parallel languages such as C*, Fortran D, HP Fortran (an extension of Fortran 90), the program constructs encourage the use of the hardware parallelism

• Still requires good optimizing/cognizant compilers

• Example (where A, B, and C are arrays), such constructs as:A = B + C

S = sum(B*C)

are explicit parallel constructs (the order of evaluation is unspecified)

• OpenMP, an extension to Fortran and C, also provides specific languagesupport for parallelism but the way parallelism is specified is explicit in the program


Avoidance Continued

• Processing still too slow (costly)• Multiple processors each with their own memory (COTS)• Parallel distributed machines with an interconnection fabric

-- sometimes called multi-computers

– Programmed typically by a message passing software layer (usually a library of procedures)

• Parallelism is specified explicitly in the program• Standard examples are MPI, PVM, and lots of vendor

specific support libraries such as MPL (IBM), NX (Intel)

4


Parallel Architectures--Control Mechanism

• Replace the CPU by many processors– Centralized control or synchronization mechanism -- the

concept sometimes referred to as an vector/array machine• Issues the same instruction to all processors

– Each processor has different data so that different computations are performed

– Single instruction multiple data machine (SIMD)– Uses the data parallelism idea (C*, Fortran D, HP Fortran)

– No centralized control mechanism -- the concept sometimes referred to as a multi-computer

– Multiple CPUs and memory» Completely separate instruction streams and no synchronization» Multiple instructions multiple data (MIMD)

– SPMD -- a programming paradigm to simplify use of an MIMD architecture

» Single instruction multiple data (SPMD) with NO synchrony


SIMD vs MIMD Diagrams

Note:

• SIMD:

One global control unit

• MIMD:

Independent control units, one for each processor element (PE)

5


SIMD Program With Synchrony

Suppose divide takes longer than assignment. Then

Time

if(B==0)C = A

elseC = A/B

D = …

5ABC

0

0

Processor 0

4ABC

2

0

Processor 1

5ABC

0

5

Processor 0

5ABC

0

5

Processor 0

1ABC

1

0

Processor 2

0ABC

0

0

Processor 2

Initially

After C = A

After C = A/B

4ABC

2

0

Processor 1

4ABC

2

2

Processor 1

1ABC

1

0

Processor 2

1ABC

1

1

Processor 2

0ABC

0

0

Processor 2

0ABC

0

0

Processor 2

XXX X

Proc #0

Proc #1

Proc #2

Proc #3

X -- Idle (Masked Store or Execution

A==B

A==B

A==B

A==B

C=A

C=A

C=A/B

C=A/B

D=…

D=…

D=…

D=…

T

I

M

E

↓


SPMD -- No Synchronyif(B==0)

C = Aelse

C = A/BD = …

5ABC

0

0

Processor 0

4ABC

2

0

Processor 1

5ABC

0

5

Processor 0

D = …

Processor 0

1ABC

1

0

Processor 2

0ABC

0

0

Processor 2

Initially

After C = A and C = A/B

After D = …

4ABC

2

2

Processor 1

1ABC

1

1

Processor 2

0ABC

0

0

Processor 2

Time

Proc #0

Proc #1

Proc #2

Proc #3 A==B

A==B

A==B

A==B

C=A

C=A

C=A/B

C=A/B

D=…

D=…

D=…

D=…Suppose divide

takes longer than assignment. Then

D = …

Processor 1

D = …

Processor 3D = …

Processor 2

T

I

M

E

↓

No masking of processors

6


Parallel ArchitecturesAddress-Space Organization

• Message-Passing Architecture• Each processor has its own private memory with its own

addresses for memory locations• The programming paradigm is to pass messages from task to

task to coordinate computational tasks

• Shared-Address-Space Architecture• All processors share the same address space

– Location n is the same place for all processors– Access memory via a switch, bus, and memory router

(depending on the manufacturer)• Software uses a shared memory programming paradigm

– Sometimes specialized languages such as C*, HPF are used» In such cases, the compiler generates code to synchronize

access to the shared address space


Typical Message Passing Machine

Note:

• Memory is local, associated with each processor

• For a processor to read or write into another processor's

memory, a message consisting of the data is sent and received

7


Typical Shared Address Space Machines


Emulating Message Passing Using A Shared-Address machine

• Messages are passed notionally as follows:• Write the message to a shared space• Tell the receiver where the message is• Receiver reads the message and places in its own space

• Emulating shared memory in a distributed memory machine needs hardware support to be effective (Eg. SGI Origin)

• Thus, message passing is viewed as the more general programming paradigm

• Learn and implement messaging on either local memory or shared memory

• Implement your program in message passing paradigm and use/write a message passing library

8


The Other Hardware Component -- The Interconnection Network

• How the components (processor/memory) are connected to each other

• Static» Fixed, never changes with time» Wires/fiber connecting the processor/memory pairs» Limitation -- physical space for the connections

• Dynamic» Can be changed with time and execution» Wires/fiber connected to a switch» The connections in the switch are changed by the program

to suit the algorithm being used» Limitation -- congestion in the paths of the switch,

depending on how carefully the user program is designed


Flynn's Taxonomy

• General computational activities can be classified by the way the instruction streams and data streams are handled

– Aside: some modern processor chips support both streams with caches and different data paths -- both kept in RAM

– Single-Instruction, Single-Data (SISD)– Single-Instruction, Multiple-Data (SIMD)– Multiple-Instruction, Multiple-Data (MIMD)– Single-Program, Multiple-Data (SPMD)

9


SPMD• Each processor has its own instructions and data

stream (as in MIMD machine)• BUT, each processor loads and runs the same

executable, but does not necessarily execute the same instructions at the same time

• Typical program (PC -- program counter):if( my_id == 0 ) ! I am the master

! Set up for myself and workers…

! Perform some work myself…

else ! I am a worker! Start doing useful work from master

…endif

PC for Proc #0

PC for Proc #1

PC for Proc #2


Summary -- Parallel Issues

• Control: SIMD vs. MIMD• Coordination: synchronous vs. asynchronous• Memory Organization: private vs. shared• Address Space: local vs. global• Memory Access: uniform vs. non-uniform• Granularity: power of each processor• Scalability: efficiency with number of processors• Interconnection Network: topology, routed,

switched

10


Categories Of Parallel Processors

• Vector or array processor• SMP: symmetric multiprocessor• MPP: massively parallel processor• DSM: distributed shared memory• Networked cluster of workstations or PCs• Hybrid combinations

• SMP or MPP with vector processors• Networked clusters of SMPs


An Idealized Parallel Computer

• Called the parallel random access machine (PRAM)• p processors each with a global memory size• Memory access is uniform by all processors of any memory location

(all processor share the same address space)• The processors are synchronized by a single global clock

– Thus, a synchronous shared memory MIMD machine• But a PRAM model divides into different classes depending on how

multiple processors read and/or write into the same memory location

11


PRAM Subclasses• Exclusive-read, exclusive-write (EREW)

• No concurrent reads or writes• This is the subclass with the least concurrency

• Concurrent-read, exclusive-write (CREW)• Multiple processor reads to the same address allowed• No concurrent writes -- multiple processor writes must be serialized --

one at a time

• Exclusive-read, concurrent-write (ERCW)• No concurrent reads allowed -- reads must be serialized• Concurrent writes allowed

• Concurrent-read, concurrent-write (CRCW)• Concurrent reads allowed• Concurrent writes allowed


Concurrency Issue

• Concurrent reads are not a problem• But concurrent writes must be arbitrated

• Four useful arbitration protocols to be analyzed are:– Common -- write only if the current writes are for the same

value -- otherwise the writes fail– Arbitrary -- arbitrarily let one processor write and the other

processor writes fail– Priority -- the processors are orded in priority and the processor

with highest priority succeeds in writing and all the others fail– Sum -- the sum of all the processor write values is written into

the location and no processor fails

12


Dynamic Interconnection Networks

• Consider an EREW PRAM with p processors and m memory locations

• Must connect every processor to every memory location• Therefore, we need O(mp) switch elements

– Say, a switch in the form of a crossbar» nonblocking but uses mp switch elements

– This is very costly to build, like prohibitive cost

– Compromise by placing the memory in banks– Say b banks with m/b memory locations per bank– It is considerably less costly than the above situation -- O(pb)

switch elements– Its disadvantage is that it blocks all processors but one

reading/writing any location in a bank if one is read/writing to a bank

– The Cray Y-MP was built with this kind of switch


Crossbar Switch

• Non-blocking in the sense that a processor to a particular memory bank does NOT block another processor to a different memory bank

• Contention when more than one item from a given memory bank is required -- a bottleneck is created

13


Crossbar Switch Properties

• The number of switch elements is O(pb)• It makes no sense to have b less than p

– Thus, the cost is Ω(p2)» This is notation for a lower bound -- that is, there

exists an N such that the cost ≥ c*p2 for some constant c and for all n ≥ N

• It is not cost effective to have b near m• A frequently configured system is with b as some

modest and constant multiple of p, often 1– For this case, the cost is O(p2)


Bus-Based Properties

• A bus is a connection between all processors and all memory banks

• A request is made to the bus to fetch or send data– The bus arbitrates all requests and performs the data transfer

when not used (or overused) for other requests

– The bus is limited in the amount of data that can be sent at once

– As the number of processors and memory banks increases, processors increasingly wait for data to be transferred

– Results in what is called saturation of the bus

14


Use Of Caches

• The saturation issue is alleviated by processor caches as follows

• Assuming the program needs data in address-local chunks, the architecture transmits a block of items in consecutive addressesinstead of just one item at a time

– The time to send a block is often nearly the same time as the time to get the first item (high latency, large bandwidth)

– By sending junks at a time, it reduces the number of requests tothe bus for data from memory and thus reduces the saturation of the bus

• Drawback: the caches are copies of memory and the copied data may be in different caches

– Consider writing into one of them» All caches must have the same value for the same memory

location (made coherent) to maintain the integrity of the computation


Bus Based Interconnection

• Caches introduce the requirement for cache coherency

15


The Compromise Connection

• Crossbars are scalable but become very expensive as they scale up

• Buses are inexpensive but are non-scalable because of saturation

• Is there anything that has moderate cost but is scalable?

• Yes: the multistage switch– A set of stages that does not grow quickly with increases in the

number of processors -- thus, not so costly» The set of stages grows with the number of processors in a

modest way to avoid saturation– The elements in the stages are simple so that they are

inexpensive yet can grow to avoid severe saturation


Multi-Stage Network

• The stages are switches• Eg: an omega network is made up from a collection of 2x2

(sub)switches in the pass-through or crossover configuration

16


Cost and Performance Vs Number of Processors

• Crossbars cost the most and buses the least as the number of processors increases

• BUT, crossbars perform the best and buses the least as the num. of procs increases

• And multistage switches are the compromise and are frequently used


The Omega Network• Assume b = p• The number of stages is log(p)• The number of switches in a stage is p/2• Each switch is very simple

• p/2 crossover 2x2 switches -- each switch:– has 2 inputs and 2 outputs (hence p/2 switches per stage)– is either in pass-through or crossover configuration

» The configuration is dynamic and is determined by the p-th bit of source and destination addresses of the current message

– Each stage has p inputs and p outputs• Each set of connections from 0-th position and between all stages

is arranged in what is called a perfect shuffle -- see next slide

17


Perfect Shuffle

• Position i is connected to position j if:

−≤≤−+−≤≤

=12/,1212/0,2

pippipii

j


Configuration For 2x2 Switch

• Suppose a particular 2x2 switch is in the k-th stage encountering a message from processor numbered agoing to processor numbered b. Then, this switch is in:

– pass-through configuration when the k-th bits of a and b are the same– crossover configuration when the k-th bits of a and b are the different

18


Complete Omega Network For 8x8 Switch

Each box is a 2x2 crossover switchTrace of the path for a message from processor 0 to bank 5 (eg: (000)→(101)) is:– Stage 0, switch 0 in crossover

configuration• message goes to stage 1, switch 1

– Stage 1, switch 1 in pass-through configuration

• message goes to stage 2, switch 2

– Stage 2, switch 2 in crossover configuration

• Message goes to output 101 (5)

Processors BanksSwitch Stages


But An Omega Switch May Block

• Consider 2 messages sent at the same time• One from processor 2 (010) to back 7 (111)• The other from processor 6 (110) to bank 4 (010)

– They both route to switch 2, stage 0– One requires the switch to be in pass-through configuration– The other requires the switch to be in crossover configuration– Both messages end up on the second output port of this switch

– Either the switch blocks or the link from the second output of switch 2, stage 0 to switch 1, stage 1 overloads (blocks -- two messages simultaneously try to traverse the same wire or link)

19


Blocking In An Omega Network


Static Interconnection Networks

• No switches present (the examples here are connected processors, not connected processors and banks, but the situation is equivalent)

• The connections are fixed essentially by directly connected wires between processors

• In effect, the processor is the switch, selecting the wire the message is to traverse

• In contrast, in a switch, the message has a header which is readby the switch as the message is received and is routed by analyzing the sender and receiver addresses in the header

– There is potentially less overhead with static interconnection networks. On the other hand, the processor somehow has to know where to route the message.

20


The Completely Connected Network

• Like a crossbar switch (see figure on slide 41)• Every disjoint pair of processors can communicate in

one hop without creating a block situation• In addition, the completely connected network can have

one processor communicate to all others simultaneously (as long as the processor supports this)

• But, it has physical scaling problems:» A chaotic collection of wires for even a modest number of

processors (consider say, 128 processors) where it is physically difficult to find space for the connections


A Star Connection

• One processor can communicate to all others in one hop

• All other pairs of processors require two hops– The center processor thus becomes a bottleneck if the processor

communication is between the "starlets"– This network behaves very much like a bus

» congestion is possible and likely

• Many ethernet clusters in offices without switches behave like this

– With switches, the back planes have very large bandwidths to ameliorate the congestion (saturation) conditions

21


A Complete Connection and A Star Connection Static Network

• Complete connection» Everyone to everyone connection -- no preferences

• Star connection» The boss/worker model -- the center is the boss


A Linear And Ring Connected Static Network

• Linear– Have to pass the message in correct direction

» Message goes through intermediary processors (multiple hops)• Ring

– For the shortest path, you have to pass the message in the correct direction

» Message still goes through intermediary processors (multiple hop s)

22


2-D Mesh, 2-D Wrapped Mesh, and 3-D Mesh Connected Static Networks

• (b) is also called a 2-d wraparound mesh or 2-d torus• If not square, called a 2-d rectangle mesh or torus

• These kinds of configurations have been built, including a 3-d torus


Tree Networks With Message Routing

• Linear array and ring are special cases of trees• (a) is a complete binary tree with a static network

» The processors at the connections are also routers or switches

• (b) is a dynamic tree network» The processors are only at the leaves -- connections are only switches

23


Message Routing In Trees

• Routing in a dynamic tree• The message starts up the tree until it finds the root of the

smallest tree that contains the sender and receiver• Then, the message goes down the tree to the receiver• For a static tree, the same algorithm works except that the

message may not go down the tree

• Note that there is a congestion in trees• The upper parts of the tree get lots of traffic, more traffic

than the lower parts• To ameliorate the congestion, fatten the links at the upper

parts of the tree» Called a fat tree interconnection


Dynamic Fat Tree Network

• This tree doubles the number of paths as you go higher in the tree• The CM-5 used a dynamic fat tree with 100s of processors

• I believe it did not double in width at all levels

24


Hypercube Connections• A cube of dimension d with 2d processors

– For d = 1, it is a linear configuration (linear array) with 2 (= 21) processors, one at each end

– For d = 2, it is the configuration of a square, with processors on the corners (4 = 22 of them)

– For d = 3, it is the configuration of a cube, with processors on the corners (8 = 23 of them)

– … (and so on -- can you guess what d = 4 is)• The very special properties of such configurations are:

– For any d-dimensional cube, every processor is connected to d other processors

– The number of steps (connection segments) between any pair of processors is at most d

– Actually, the shortest distance is the Hamming distance between the binary representation of the processor numbers, assuming the numbering is from 0 to 2d–1

• Note that a d-dimensional hypercube is made up from two (d–1) dimensional hypercubes , connected at their corresponding positions


1-D,2-D, 3-D Hypercube Networks

25


Distinct Partitions Of A 3-D Hypercube

• Always d distinct partitions of a d-dimensional hypercube into two subcubes of dimension d–1


More Properties Of Hypercubes

• Two processors are neighbors (connected by a direct link) if the binary representations of their processor numbers differ in exactly on bit:

– Recall the Hamming distance is s ⊕ t (where ⊕ is exclusive or -- 1 only in positions where bits are different)

– Thus, s and t are neighbors if and only if s ⊕ t has exactly one 1-bit

• Consider the set of all processors with the same kbits (any subset of the d bits). Then:

• This set of processors forms a d–k dimensional hypercube– It is a sub-hypercube of the original hypercube

26


Sub-cubes


k-ary d-cube Networks

• Instead of 2 processors in each dimension of the cube (as with the hypercube), consider kprocessors connected in a ring:

• d-dimensional torus with p processors in each dimension is a p-ary d-cube

• k linearly connected processors is a k-ary 1-cube

27


Evaluating Static Interconnection Networks

• Cost and performance measures of networks– Diameter:

• Maximum distance between any two processors– distance here is the shortest path between two processors

• Examples:» for a 2-d hypercube, it is 2» for a completely connected network, it is 1» for a star network, it is 2» for a linear array of size p, it is p–1» for a ring of size p, it is p/2» for a p × q mesh, it is p+q – 2» for a d-dimensional hypercube, it is d» for a hypercube of p processors, it is log2 p


Evaluating Static Networks Continued

– Connectivity:• a measure of the number of paths between any two processors

– the higher the connectivity, the more choice in routing messages and less chance of contention

• arc connectivity:– the least number of arcs (connections) that must be removed to break the

network into disconnected networks– Examples:

» for a star, linear array, and tree network, it is 1» for a completely connected network of p processors, it is p–1» for a d-dimensional hypercube, it is d» for a ring, it is 2» for a wraparound mesh, it is 4

28


Evaluating Static Networks Continued– Bisection width, channel width, channel rate, and

bisection bandwidth:• Bisection width is:

– the least number of arcs that must be removed to break the network into two partitions with an equal number of processors in each partition

• Channel width is:– the maximum number of bits that can be transmitted simultaneously over

a link connecting two processors» it is equal to the number of wires (arcs) connecting the two

processors• Channel rate is:

– is the rate at which bits can be transferred (bits/sec)• Channel bandwidth is:

– is the peak rate that a channel can operate; that is, the product channel rate * channel width

• Bisection bandwidth is:– is the peak rate along all the arcs that represent the bisection width; it is

the product bisection width * channel bandwidth


Examples Of Bisection Widths

• Bisection widths:– of d-dimensional hypercube with p processors, it is 2d–1 or p/2– of a p×p mesh, it is p

– of a p×p wraparound mesh, it is 2p– of a tree, it is 1– of a star, it is 1 (by convention -- you cannot make two equal

partitions)– of a completely connected network of p processors, p2/4

29


Cost Of Static Networks

• There are two common measures:• The number of arcs (links)

– for a d-dimensional hypercube with p=2d processors, it is (p log p)/2 -- the solution to a recurrence (cd = 2cd–1+2d–1, c0 = 0)

– for a linear array and trees of p processors, it is p–1

• Or the bisection bandwidth– It related to measures of minimal sizes (area or volume) of the

packaging to build the network– For example, if the bisection bandwidth is w of a:

» 2-dimensional package, a lower bound on the area is Θ(w2)» 3-dimensional package, a lower bound on the volume of the

network is Θ (w3/2)


Summary: Characteristics Of Static Network Topologies

dp2d2kd–1d k/2Wrap k-ary d-cube

(p log p)/2log pp/2log pHypercube

2p42 √ p2√p/22-D no-wrap mesh

2(p – √p)2√ p2(√p) –1)2-D wrap mesh

p22p/2Ring

p–111p–1Linear

p–1112 log((p+1)/2)Compl Bin. Tree

p–1112Star

p(p–1)/2p–1p2/41Complete

Cost

(No. of Links)

Arc ConnectivityBisection

Width

DiameterNetwork

30


Evaluating Dynamic Interconnection Networks• What is diameter for dynamic networks?

• Nodes are now defined as processors and switches– both have a delay connected with processing a message that passes

through them• Thus, it is defined as maximum number of nodes between any two

processors

• What is connectivity for dynamic networks?• Similarly, it is defined as the minimum number of connections that

must be removed to partition the network into two unreachable parts

• What is bisection width for dynamic networks?• Defined as the minimum number of edges (connections) that must be

removed to partition the network into two halves of an equal number of processors

• What is the cost of a dynamic network?• Depends on the link cost and the switch cost• The switch cost, in practice, dominants so that it becomes the number

of switches


An Example Of Bisection Width

• Lines A, B, C are 3 cuts that cross the same number (4) of connections -- partitions the processors into two equal sizes

– This is the minimum number of connections for any cut– Thus, the bisection width is 4

P

PP

P

S

S

SS

a

a

a

aA

B

C

31


Summary: Characteristics Of Dynamic Network Topologies

p–1212 log pDynamic Tree

p/22p/2log pOmega Network

p21p1Crossbar

Cost (No. of Links)

Arc ConnectivityBisectionWidth

DiameterNetwork


Cache Coherence In Multiprocessor Systems• Complicated issue with uni-processors, particularly

when there are multiple levels of cache and caches are refreshed in blocks or lines

• For a multiprocessors, the problem is worse• As well as multiple copies, there are multiple processors trying to

perform reads and writes of the same memory locations

– There are two frequently used protocols to ensure the integrity of the computation

• Invalidate protocol:» Invalidate all locations that are copies of a location when one

of the copies is changed. Update the invalid locations only when needed -- this is the most frequently implemented currently

• Update protocol:» Update all locations that are copies of a location when one of

the copies is written

32


Cache Coherence In Multiprocessor Systems

x = 1

P0

load x

x = 1

Memory

x = 1

P1

load x

x = 1

P0

load x

x = 1

Memory

x = 1

P1

load x

Invalid Protocol

Update Protocol

x = 3

P0

write #3, x

x = 1

Memory

x = 1

P1

Invalidate

x = 3

P0

write #3, x

x = 3

Memory

x = 3

P1

Update


Property Of The Update Protocol

• Invalidate protocol:– Suppose a processor reads a location once (thus placing it

in its cache) and never accesses it again– Another processor reads and updates a location many

times (also in its cache)

» Behavior: the second processor causes the first processor's cache location to be marked invalid and is not necessary updated in the first processor's cache --to continually update the first processor's cache would be a waste of effort

33


The False Sharing Issue– Caused by cache lines -- the other addresses in the cache

that are brought to the cache when any single value is the processor's cache

• The sharing of data among processors now occurs when a second processor access a data item in the same line but not the same value

– Thus a whole block (line) of data is shared and located in two or more caches

» Now consider that each processor repeatedly writes into a different item in the same cache line

» It looks like (thus false) the items are being shared but they are not

» Updates to one item in the line require the whole line to be invalidated or updated as the protocol requires when in fact there is no sharing of the same item

– It turns out the cost for update protocol is slightly less– But the tradeoff between communication overheads (updating) and

idling (stalling for invalidates) is better for the invalidate protocol


Maintaining Coherence With The Invalidate Protocol

• For analysis and understanding, consider 3 states for a memory address:

• Shared state» Two or more processors have loaded the memory location

• Invalid state» At least one processor has updated the value and all other

processors mark their value with this state

• Dirty state :» The processor that modifies a value and there are other

copies of it in other processor is marked with this state» The processor with a value with this state is the source

processor for any updates to this value, when needed

34


State Diagram For The 3-State Coherence Invalidate Protocol

Invalid Dirty

Shared

read write

C_write

C_write

read

C_read

write

flush

read/write

Legend Of State Changes

-- processor action

-- coherence action

flush action -- may occur when processor replaces a dirty item in a cache-replacement action


Parallel Program Execution On A 3-State Coherence System Using The Invalidate Protocol

x = 5, Sy = 12, Sx = 5, Iy = 12, Iy = 13, Sx = 6, Sx = 6, Iy = 13, Ix = 6, Iy = 13, I

y = 12, S

y = 13, Dy = 13, Sx = 6, Sx = 6, Iy = 19, D

y = 20, D

x = 5, S

x = 6, D

y = 13, Sy = 6, Sx = 19, Dy = 13, Ix = 20, D

read y

y = y + 1

read x

y = x + y

y = y + 1

read x

x = x + 1

read y

x = x + y

x = x + 1

x = 5, Dy = 12, D

Variables and their states in global mem.

Variables and their states at processor 1

Variables and their states at processor 0

Instruction at processor 1

Instruction at processor 0

35


Implementation Techniques For Cache Coherence

• Three frequently used methods are:• Snoopy systems• Directory-based (snoopy) systems• Distributed directory-based systems


Snoopy Cache Systems

• On broadcast interconnection networks with a bus or a ring:

• Each processor snoops on the bus looking for transactions that effect its cache

• The cache has tag bits that specify the cache line state– When a processor with a dirty item sees a read request for

that item, it gets control of the request and sends the data out– When a processor sees a write to an item it owns, it

invalidates its copy

36


Diagram Of A Snoopy Bus

Cache

Processor

TagsSnoop H/W

Cache

Processor

TagsSnoop H/W

Cache

Processor

TagsSnoop H/W

Dirty

Address/data

Memory


Performance Of Snoopy Caches• Extensively studied and implemented

• Implementation is simple and straightforward» Can be easily added to existing bus-based systems

• Good performance properties in the sense that:» If different caches access different data (the expected case),

it performs well» Once designated dirty, the processor with the dirty tag can

continue use of the data without penalty» Also computations that only read shared data perform well

• Poor performance in the case that the processors are reading and updating the same data value

» Generates many coherence operations across the bus» Because it is a shared bus (between processors and with

data movement), the coherence operations saturate the bus

37


Directory-Based Systems

• Directory-based systems provide a solution to this problem

» With snoopy buses, each processor has to continually monitor the bus for updates of interest

» The solution is to find a way to direct the updates only to the processors that need the updates

• This is done with a directory associated with each block of memory specifying who has each block and when the block is shared


Centralized Directory-Based System

Processor

Cache

Processor

Cache

Processor

Cache

Interconnection Network

State Presence Bits

Data

Directory and Memory

S 1 10 0

Shared state for processors 0 and 3

38


Typical Scenario For A Directory-Based Snoopy System

• Consider the scenario of slide 63• Both processors access x (with value 1)

» x is moved to each cache and is marked shared• Processor 0 executes store to x (with value 3)

» x in the directory is marked as dirty» Presence bit for all other processors is marked off» Processor 0 can access the changed x at will (memory not

changed)• Processor 1 accesses x

» x has a dirty tag and sees that processor 0 has x» Processor 0 updates the memory block and sends the

updated x to 1» The presence bits for processors 0 and 1 are set and x is

now marked as shared

S 1 1 0 0 1

D 1 0 0 0 1

S 1 1 0 0 3


Performance Of Directory Based Systems– Implementation is more complex

• Good performance properties (as before) in the sense that» If different caches access different data (expected usually case), it

performs well» Once designated dirty, the processor with the dirty tag can continue

to use of the data without penalty» Also computations that only read shared data perform well

• Poor performance in the case that the processors are reading andupdating the same data value

– Encounters two kinds of overhead:» Propagation of state and generation of state» Results in respectively communication and contention problems» The contention is on the directory -- too many requests to the

directory for information» The communication is on the bus -- Solution: increase its bandwidth

– There is an issue with the directory -- it takes space» Reduce the size of the directory by increasing the block length of

the cache line -- BUT may increase false sharing overhead

39


Distributed Directory Schemes• Contention on the directory of states can be

ameliorated by distributing the directory• This distribution is performed consistently with memory distribution

» The state and presence bits are represented near/in each piece of the distributed memory

» Each processor maintains the coherence of its own memory• Behavior:

» A first read is a request from the user processor to the owner processor for the block, and state information is set in the owner's directory

» A write by a user propagates an invalidate to the owner and the owner forwards that state to all other processors sharing the data

• The effect:» The directory is distributed and contention is only for the owner

processor -- not the previous situation where several processors are contending for the information in the directory


Distributed Directory-Based System

Interconnection Network

Processor

Cache

Presence Bits/State

Memory

Processor

Cache

Presence Bits/State

Memory

40


Performance Of Distributed Directory Schemes

• Better in performance– Because this design permits O(p) simultaneous coherence

operations

• Scales much better than the simple snoopy or centralized directory-based systems– Now latency and bandwidth of the network

become the bottleneck


Message Passing Costs• Startup time ts:

• Time to process the message at the send and receive nodes» Time for adding header, trailer, error correction information, executing

the router algorithm, and interfacing between node and router

• Per-hop time th:• Time for the header of the message (over 1 link) to leave one node and

arrive at the next node» Also called node latency» It is dominated by the latency time to determine which output buffer or

channel to route the message to

• Per-word transfer time tw:• It is 1/r where r is the channel bandwidth measured in words per second

» This time includes network and buffering overheads

41


Store-Forward Routing Scheme

• To send a message down a link,• The entire message is stored in the receiver before

further processing of the message can begin at the receiver or of other messages in the sender

• Total communication time:• For a message of size m traversing l links, it is:

tcomm = ts + (mtw + th) l

• For typical algorithms, th is small compared with mtw, even for small m and so th is ignored. That is,

tcomm = ts + mtw l


Packet Routing

• Waiting for the entire message is inefficient• For LANs, WANs, and long-haul networks, the message is

broken down into packets– Intermediate nodes wait only on small pieces (packets), not the

whole message– Reduces the overhead for handling an error

» Only a packet gets resent, not the entire message– Allows packets to take different paths– Error correction is on a smaller pieces and so is more effective– BUT, there are overheads because

» the packet has to have routing information, error correction, and sequencing information stored in it

• The advantages outweigh the overhead incurred

42


A Cost Model For Packets• The packet size is m = r + s where:

– r is the size of the original message– s is the size of additional information to deal with packets

• The time to packetize the message is proportional to the size of the message and is: mtw1

• Let the number of hops be l• Let the time to communicate one word over the

network every second be tw2 with latency th– Time to receive the first packet of a message: thl + tw2(r+s)– There are m/r – 1 remaining packets to send– Thus the time is (simplified):

tcomm = ts + thl + twm where tw = tw1 + tw2(1 + s/r)


Cut-Though Routing• Packets but

• The processor-interconnect network is limited in generality, size, and amount of noise or interference

– Thus, the overheads can be reduced because:» No routing information is needed» No sequencing information needed as the packets are transmitted

and received in order» Errors can be associated with the whole message instead of

packets» Errors occur less frequently so simpler schemes can be used

• The term cut-through routing is applied to this simpler packetizing of the messages

– The packets are fixed size, are called flow control digits or flits» Smaller than long-haul network packets as no headers

43


Cut-Though Routing Continued

• Tracer packets establish the route:– A tracer packet is sent ahead to initiate the route for the

message -- sets the path for the flits to follow– The flits pass down one after the other

– They are not buffered at each node but are passed on as soon as their arrival is complete without waiting for the next one -- reduces memory and memory bandwidth needs at the node


Store-Forward And Cut-Through Routing

P4

44


A Cost Model For C-T Routing• Assume the number of links is l• Assume the number of words in the message is m• Assume the startup time is ts

• There is clearly time to startup and shutdown the flit pipeline which is proportional to l and must be included in the per-hop time below

• There is also the time to send the tracer packet which is proportional to l and must be included in the per-hop time

• Assume the per-word transmission time is tw• Then the cost for m words is m tw

• Assume the per-hop time is lth• It clearly depends on the number of links in this way because each

processor is simultaneously handling a different flit

• Then, the communication cost is:tcomm = ts + lth + m tw

– Compare this with the store-forward time tcomm = ts + mtw l


Performance Comparison And Issues• For C-T routing vs S-F routing:

• S-F has the compounding factor ml• C-T routing is linear in m and l

• Size of flits:• Too small implies processing of lots of flits and the processing

time must be very fast• Too large means more memory and memory bandwidth is

needed or latency is increased– Thus, there is a tradeoff in the design of the routers

» Flits are typically 4 bits to 32 bytes

• Congestion and contention• The C-T routing can deadlock, particularly when heavily

loaded– The solution is message buffering and/or careful routing

45


Deadlock In Cut-Through RoutingDestinations of messages 0, 1,

2, and 3 are processors A, B, C, and D respectively

Flit from message 0 occupies path CB

It cannot progress because the flit from message 3 occupies path BA

And so on


A Simplified Cost Model For C-T Routing• The cost model for C-T routing is:

tcomm = ts + l th + m tw

• To minimize this cost, we might design our software so that we:

– Communicate in bulk (reduce the effect of ts)» This reduces the number of messages so that the

number of times we pay for ts is reduced -- this is appropriate because ts is usually large relative to thand tw

– Minimize the data volume (reduce the term m tw)» Reduce the amount of communication

– Minimize the distance traveled by the messages (reduce l)» This reduces the term lth

46


A Simplified Cost Model Continued

• On the other hand:• Message passing libraries such as MPI give the users very little

control of the mapping between the programmer's logical processors and the machine's physical processors

• Message passing implementations will use 2-hop schemes to reduce contention by picking the middle node at random

• Most switches nowadays have essentially equal link times between any two processors

• Thus, lth is small compared with the other terms

– For these reasons, we use the simplified cost model: tcomm = ts + m tw

– That is, drop the lth term


Communication Costs For Shared-Address-Space

• The bottom line is that there is no uniform and simple model except for very special cases and architectures -- we give up for a general treatment

• The issues that cause this are:– Memory layout

» Determined by the system and compiler– Size of caches, particularly when there are small caches– Details of the invalidate and/or update protocols for the caches– Spatial locality of data

» Cache line sizes vary and the compiler locates the data– Pre-fetching by the compiler– False sharing (depends on threading, scheduling, compiler)– Contention for resources (cache update lines, directories, etc)

» Depends on the execution scheduler

47


Routing Mechanisms For Interconnection Networks

• Routing mechanism• The process whereby the network determines what route or

routes to choose for a message and the way it routes the messages

» May use the state (how busy parts of the network are) of the network to determine the path

– Minimal routing• A path of least length is used -- usually computed directly from

the source and destination address– Minimal routing can lead to congestion so that non-minimal

routing is often used

– Non-minimal routing• The least length path is not chosen

– To avoid congestion» route is selected randomly or uses network state information


Routing Mechanisms Continued

• Routing can be deterministic:• Always the same route, given the addresses of the source and

destination

• Routing can be adaptive:• Tries to avoid congestion and delay, depending on what is

currently being used (or may be a 2-hop scheme with the middle node selected at random)

• Dimension ordered routing• Uses the dimension properties of the connection network to

determine the path -- routes via minimal paths along dimensions» For a mesh, it is X-Y routing» For a hypercube, it is called E-cube routing

48


X-Y Routing For A Mesh

• Send the message along the row of the source processor until it reaches the X value of the destination processor

• Then, send the message along the column to the destination processor– This is a minimal path

•

•


E-Cube Routing

• Let Ps and Pd be the binary labels for nodes s and d– Then, the Hamming distance is Ps ⊕ Pd

• Its 1s indicate what dimensions to send the message along– Used to compute the dimension from the destination address, given

the address of the current node where the message is; that is,» Compute Ps ⊕ Pd and send the message from Ps along the

dimension corresponding to the least significant bit» The message is now at processor Pq and now form Pq ⊕ Pd

» Now send the message from Pq along the dimension of the least significant bit of Pq ⊕ Pd and now the message is at a new Pq

» Repeat until the message arrives at its destination» NOTE: the assumption is that the message always contains the

address of its destination

49


E-Cube Routing For 3-D Cube

• Form the Hamming distance– Send the message from the source processor to the next destination

determined by changing the least significant bit in the Hamming distance of source processor and repeat for each new source until dest. is reached


Embedding Other Networks Into A Hypercube Network

• Using all of the processors of a hypercube and an appropriate subset of the connections, can the processors be configured to look like other networks?

• Important process because an algorithm may naturally fit the first network configuration, and the actual machine uses a different network configuration

• The problem is a general graph problem:• Given two graphs: G(V,E) and G'(V',E'),

map each vertex in V onto one or more vertices in V'map each edge in E onto one or more edges in E'

• The vertices are processors and the edges are network links

• Yes, as shown in the following cases and slides

50


Example Of A Bad Case

• Take a 4×4 mesh and map the nodes at random to another 4×4 mesh

• The original arrangement was designed to avoid congestion of communication

– Each link is used just once for all pairs of communication

• The particular random arrangement can congest up to 5 messages on one link

– k communicates with g, o, and l down the k-h link– j communicates with i along the k-h link– d communicates with h along the k-h link

» text suggests 6 links but I do not see that case

2/11/2003 platforms 100

Diagram Of Bad Case

• Up to 5 paths mapped to the same link assuming the original codeperformed only nearest neighbor communication

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

a b

e

c d

i

m

f

j

n o p

g h

k l

Only nearest neighbor communication on

dotted lines

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

k h

j

m i

d

c

p

e

l g f

o b

a n

Processors mapped at random to the same grid

k communicates with g, o, and l down the k-h link

d communicates with h along the k-h link

j communicates with i along the k-h link

51

2/11/2003 platforms 101

Properties Of Such Mappings

• The maximum number of edges in E mapped into a single edge in E'

• This is called the congestion• It measures the amount of traffic required on an edge in G'

• The maximum number of edges in E', joined together, to correspond to a single edge in E

• This is called the dilation• It measures the increased delay in G' caused by traversing multiple links

• The ratio of number of processors in V' to the number of processors in V

• This is called the expansion

• However, the cases described next all have expansion of 1

2/11/2003 platforms 102

Embedding A Linear Array Into A Hypercube• Consider a linear array (or ring) of 2d processors

• For convenience, linear array processors are labeled from 0 to 2d – 1

– Processor i in the linear array maps to processor numbered G(i,d) in the hypercube using the following mapping:

≥−−+

<=+

==

+ xxx

x

ixiG

ixiGxiG

GG

2),,12(2

2),,()1,(

1)1,1(0)1,0(

1

• G is called the binary reflected Gray code (RGC)• The d+1 Gray codes are derived from the d Gray codes as follows:

• For d+1, take two copies of the d codes• For one copy of the d codes, prefix a 0 bit• for the other copy, reflect the d codes and prefix a 1 bit

52

2/11/2003 platforms 103

A Ring Of 8 Processors Embedded Into a 3-d Hypercube

G(i,3)

i

Order in the Ring

This mapping has dilation of 1 and congestion of 1

Node 0 of the Ring

Last node (7) of the

ring

2/11/2003 platforms 104

A Ring Of 8 Processors Embedded Into a 3-d Hypercube

• Note:• Hypercube processors in consecutive rows of the

Gray table differ by one bit -- thus, they are adjacent in the cube

– Thus, each edge in the linear array maps to one and only one edge in the hypercube

– Thus, the dilation (maximum number of edges in E' an edge is E is mapped to) is 1

– Also, the congestion (max. number of edges in E mapped onto a single edge in E' ) is 1 by the requirements of G

53

2/11/2003 platforms 105

Meshes Embedded Into Hypercubes• We consider the mesh to be a wraparound

mesh of size 2r × 2s and the hypercube of dimension r+s

– We use the properties of the mapping of a ring to a hypercube to do this -- because each row and column is a ring -- as follows:

» For processor (i,j) in the mesh (the processor at the intersection of row i and column j), the binary hypercube processor number is G(i,r) || G(j,s) where || is the concatenation of two binary strings

2/11/2003 platforms 106

Meshes Embedded Into Cubes

54

2/11/2003 platforms 107

Embedding A Square p-Processor Mesh Into A p-Processor Ring

• The number of links in a p processor square wraparound mesh is: 2 √p √p = √p

– √p links on each row with √p rows– Considering the columns, there are just as many links again– Thus, there is a total of 2√p links

• The number of links in a p-processor ring is p• Thus, there must be congestion• The natural mappings from mesh to ring and

ring to mesh are shown on the next slide

2/11/2003 platforms 108

Mesh To Linear And Vice Versa

Legend:Bold lines are links in the linear arrayDotted lines are links in the mesh

Linear Array Onto MeshCongestion: 1

Dilation: 1

Mesh Onto Linear ArrayCongestion: 5 = √p + 1Dilation: 7 = 2√p – 1

Congestion -- 5 crossings

Dilation -- 7 bold edges

55

2/11/2003 platforms 109

Can We Do Better?

• What is a lower bound for the congestion?» Recall that congestion is the maximum number of edges in

E mapped into a single edge in E '• It turns out to be √p so it is possible to do better but not that

much better– Simple argument but overly conservative:

» The mesh has 2p links, the linear array has p links» Therefore, a congestion of 2 seems possible

• Proof (using bisection width, not links):– The bisection width of a 2-D mesh is √p– The bisection width of a linear array is 1– Thus, the congestion is at best √p

2/11/2003 platforms 110

Hypercubes Embedded Into 2-D Meshes

• Assume p nodes in the hypercube and assume p is an even power of 2

• Treat the hypercube as √p hypercubes each with √p nodes» Let d = log2 p -- d is even by assumption» Consider the hypercubes with the least significant d/2 bits

varying and the first d/2 bits fixed• Map each of these hypercubes to a row of √p × √p mesh, each

row having √p nodes and there are √p rows– Use the inverse of the mapping from a linear array to a

hypercube on slide 102– Connect the nodes column-wise so that the nodes with the same

d/2 least significant bits are in the same column» Notice that this is a similar arrangement as for the rows

56

2/11/2003 platforms 111

16-Node Hypercube To 16-Node 2-D Mesh

00 01

1110

00 01

1110

00 01

1110

00 01

1110

0100

1110All nodes numbered

10 are in the first column

Nodes numbered in sequence 00,01,11,

10 are in a row

See Textbook For a 32-Node Example, Figure 2.33, Page 72

2/11/2003 platforms 112

Congestion Is √p/2 And Is Best• The argument is as follows:

– The bisection width of this mapped 2-D mesh is √p/2• Proof:

– The bisection width of a √p hypercube is √p/2– The bisection width of a row (linear array) is 1– The congestion of a row is thus √p/2 (the ratio [√p/2]/1)– Similarly, for the columns, the congestion is √p/2– Because the row and column mapping affects disjoint sets of nodes,

the bisection width is the same, namely √p/2

– The lower bound for the bisection width is √p/2• Proof:

– (Links of a hypercube)/(links of mesh) = p/(2√p) = √p/2

– Thus, because the lower bound is equal to the actual, this is the best mapping for a hypercube to a 2-D mesh

57

2/11/2003 platforms 113

Processor-Processor Mapping And Fat Interconnection Networks

• The previous examples were mappings of dense networks to sparse networks

• The congestion was larger than one• If both networks have the same bandwidth on each link, this could be

disastrous• The denser network is usually higher dimensionality which is costly

» Complicated layouts, wire crossings, variable wire lengths• However, the sparser network is simpler and it is easier to make the

congested links fatter

– Example:– The congestion on the mapping from the hypercube to the 2-D mesh with

the same number of processors is √p/2– Widen the paths of the mesh by a factor of √p/2 – Then the two networks have the same bisection bandwidth– The disadvantage is that the diameter of the mesh is larger than the cube

2/11/2003 platforms 114

Cost Performance Tradeoffs -- The Mesh Is Better For The Same Cost

• Consider a fattened mesh (fattened so as to make the costs of the network the same) and hypercube with the same number of processors

• Assume the cost of the network is proportional to the number of wires

» Increase the links on the p-node wraparound mesh by a factor of (log p)/4 wires, which makes the p-node hypercube and p-node wraparound mesh have the same cost

• Let's compare the average communication costs• Let lav be the average distance between any two nodes

– For the 2-D mesh, this distance is √p/2 links– For the hypercube, it is (log p)/2 links

58

2/11/2003 platforms 115

Cost Performance Tradeoffs Continued• The average time for a message of length m on

average (over lav hops) is:– For the 2-D mesh with cut-through routing: ts + thlav + twm

» Because the channel width has been increased by a factor of (log p)/4, the term twm is decreased by the same factor and lav is replaced by √p/2

tav_ccomm_2D = ts + th √p/2 + 4twm/(log p)– For the hypercube (with lav = (log p)/2)

tav_comm_cube = ts + th (log p)/2 + twm

• Comparison: for large m– The average comm. time is smaller for the 2-D mesh for p > 16

» This is not true for store-forward protocol for the mesh» This comparison is for the case that the cost is determined

by the number of links

2/11/2003 platforms 116

Cost Performance Tradeoffs Continued

• Suppose the cost is determined by the bisection width

• This time, we increase the bandwidth on the mesh links by a factor of √p/4

» This is the ratio of the bisection bandwidth for a hypercube to a mesh with p processors each -- that is, (p/2)/(2√p) = √p/4 (see slide 58)

• Comparison:– For the 2-D mesh with cut-through routing (with lav = √p/2) :

tav_ccomm_2D = ts + th √p/2 + 4twm/√p– For the hypercube (with lav = (log p)/2):

tav_comm_cube = ts + th (log p)/2 + twm– Again, the average comm. time is smaller for 2-D mesh for p > 16

Documents

Parallel Programming Platforms