1 Introduction to Parallel Computing

1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors

Embed Size (px)

Citation preview

Page 1: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors


Introduction to Parallel Computing

Page 2: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors


Multiprocessor Architectures

• Message-Passing Architectures– Separate address space for each processor.– Processors communicate via message passing.

• Shared-Memory Architectures– Single address space shared by all processors.– Processors communicate by memory read/write.– SMP or NUMA.– Cache coherence is important issue.

• Lots of middle ground and hybrids.• No clear consensus on terminology.

Page 3: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors


Message-Passing Architecture

. . .










interconnection network

. . .

Page 4: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors


Shared-Memory Architecture

. . .

interconnection network

. . .










Page 5: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors


Shared-Memory Architecture:SMP and NUMA

• SMP = Symmetric Multiprocessor– All memory is equally close to all processors.– Typical interconnection network is a shared bus.– Easier to program, but doesn’t scale to many processors.

• NUMA = Non-Uniform Memory Access– Each memory is closer to some processors than others. – a.k.a. “Distributed Shared Memory”.– Typically interconnection is grid or hypercube.– Harder to program, but scales to more processors.

Page 6: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors


Shared-Memory Architecture:Cache Coherence

• Effective caching reduces memory contention.• Processors must see single consistent memory.• Many different consistency models.• Weak consistency is sufficient.• Snoopy cache coherence for bus-based SMPs.• Distributed directories for NUMA.• Many implementation issues: multiple-levels, I-D

separation, cache line size, update policy, etc. etc.• Usually don’t need to know all the details.

Page 7: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors


Example: Quad-Processor Pentium Pro

• SMP, bus interconnection.• 4 x 200 MHz Intel Pentium Pro processors.• 8 + 8 Kb L1 cache per processor.• 512 Kb L2 cache per processor.• Snoopy cache coherence.• Compaq, HP, IBM, NetPower.• Windows NT, Solaris, Linux, etc.

Page 8: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors


100 Mbitswitch


Node config

2 x 500 MHz Pentium III

512 Mb RAM

12-16 Gb disk

Beowulf-based cluster of Linux/Intel


24 PCs

Page 9: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors


The first program

• Purpose: Illustrate notation• Given

– Length of vectors M

– Data xm, ym, m=0,1,…,M-1 of real numbers, and two real scalars and .

• Compute– z = x + y, i.e., z[m] = x[m] + y[m] for m=0,1,…,M-1.

Page 10: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors


Program Vector_Sum_1


m: integer;

x,y,z: array[0,1,…,M-1] of real;


<; m: 0 m < M :: x[m]=xm, y[m]=ym>


<|| m: 0 m < M :: z[m] = x[m] + y[m]>


Here || is a concurrent operator. It means that is two operations O1 and O2 are separated by ||, i.e. O1 || O2, then the two operations can be performed concurrently independently of each other.

In addition,

<|| m:0 m < M :: Om>

is short for O0||O1||…||OM-1 meaning that all the M operations can be done concurrently.

Page 11: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors


Sequential assignmentinitially

a=1, b=2

assigna:=b; b:=a

results in a=b=2.

Concurrent assignmentinitially

a=1, b=2

assigna:=b || b:=a

results ina=2, b=1.

Page 12: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors


A model of a parallel computer

• P processors (nodes); p=0,1,…P-1.• All processors are identical.• All processors compute sequentially.• All nodes can communicate with any other node.• The communication is handled by mechanisms for

sending and receiving data at each processor.

Page 13: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors


Data distribution

• Suppose distribution of vector x with M elements x0,…xM-1 over a collection of P identical computers.

• On each computer define index setJp = {0,1,…Ip-1},

where Ip is the number of indices stored at processor p.

• Assume I0+I1+…+IP-1 = M,


stored proc 0 stored proc P-1

Page 14: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors


• A proper data distribution defines a one-to-one mapping from a global index m to a local index i on processor p.

• For a global index m, (m) gives a unique index i on a unique processor p.

• Similarly, an index i on processor p is uniquely mapped to a unique global index m= -1(p,i).

• Globally: x = x0,…xM-1

• Locally: x0,…xI0-1, x0,…xI1-1,…, x0,…xIP-1-1

proc 0 proc 1 proc P-1

Page 15: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors


• Purpose:– derive a multicomputer version of Vector_Sum_1

• Given– Length of vectors M.

– Data xm, ym, m=0,1,…,M-1 of real numbers, and two real scalars and .

– Number of processors P.

– Set of indices Jp={0,1,…,Ip-1} where the number of entries Ip on the p-th processor is given.

– A one-to-one mapping between global and local indices.

• Computez= x + y,

i,.e, z[m]= x[m] + y[m] for m=0,1,…,M-1.

Page 16: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors


O,…,P-1 || p Program Vector_Sum_2


i: integer;

x,y,z: array[Jp] of real;


<; i: iJp :: x[i]=x-1(p,i), y[i]=y -1(p,i)>


<|| i: iJp :: z[i] = x[i] + y[i]>


Notice that we have one program for each processor - all programs being identical.

In each program, the identifier p is known. Also the mapping is assumed to be known.

The result is stored in a distributed vector z.

Page 17: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors


Performance analysis

Let P be the number of processors, and let

T = T(P)

denote the execution time for a program on this multicomputer.

Performance analysis is the study of the properties of T(P).

In order to analyze concurrent algorithms, we have to assume certain properties of the computer. In fact these assumptions are rather strict and thus leave out a lot of existing computers.

On the other hand; without these assumptions the analysis tend to be extremely complicated.

Page 18: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors



• Let T(1) be the fastest possible scalar computation, then

T(P) T(1)/P.

This relation states a bound for how fast a computation can be done on a parallel computer compared with a scalar computer.

Page 19: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors



• Speed-up:The speed-up of a P-node computation with execution time

T(P) is given by

S(P) = T(1)/T(P).

• Efficiency:The efficiency of a P-node computation with speed-up S(P) is

given by

(P) = S(P)/P.

Page 20: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors



• Suppose we are in an optimal situation, i.e., we have

T(P) = T(1)/P.

Then the speed-up is given by

S(P) = T(1)/T(P) = P,

and the efficiency is

(P) = S(P)/P = 1.

Page 21: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors


More generally we have T(P) T(1)/P,

which implies that

S(P) = T(1)/T(P) P,


(P) = S(P)/P 1.

In practical computations we are pleased if we are close to the optimal results. A speed-up close to P and to an efficiency close to 1 is very good. Practical details often result in weaker performance than expected from the analysis.

Page 22: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors


Efficiency modelling

• Goal: estimate how fast a certain algorithm can run on a multicomputer. The models depend on the following parameters:

A = Arithmetic time; the time of one single arithmetic operation. Integer ops ignored, all nodes assumed equal.

C(L) = Message exchange time; the time it takes to send a message of length L (in proper units) from one processor to another. We assume that this time is equal for any pair of processors.

L = Latency; the start-up time for a communication - or the time it takes to send a message of length zero.

• 1/ = Bandwidth; the maximum rate of messages (in proper units) that can be exchanged.

Page 23: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors


Efficiency modelling

In our efficiency models, we will assume that there is a linear relation between the message exchange time and the length of the message:

C(L) = L + L.

Page 24: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors


Analysis of Vector_Sum_2

O,…,P-1 || p Program Vector_Sum_2


i: integer;

x,y,z: array[Jp] of real;


<; i: iJp :: x[i]=x-1(p,i), y[i]=y -1(p,i)>


<|| i: iJp :: z[i] = x[i] + y[i]>


Recall that Jp={0,1,…,Ip-1}, and define I = maxp Ip.

Then a model of the execution time is given by

T(P) = 3 maxp Ip A = 3I A.

Notice that there are three arithmetic operations for each entry of the array.

Page 25: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors


Load balancing

• Obviously, we would like to balance the load of the processors. Basically, we would like to have each of them perform approximately the same number of operations. (Recall that we assume all processors of same capacity).

• In the notation used in the present vector operation, we have load balance if I is as small as possible.

• In the case that M (the number of array entries) is a multiple of P (the number of processors), we have load balance if

I = M/P,

meaning that there are equally many vector entries on each processor.

Page 26: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors



For this problem the speed-up is

S(P) = T(1)/T(P) = 3MA / 3IA = M/I.

If the problem is load balanced, we have

I = M/P

and thus

S(P) = P

which is optimal.

Notice that we are typically interested in very large values of M, say M=106-109. The number of processors P are usually below 1000.

Page 27: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors


The communication cost

In the above example, no communication at all was necessary. In the next example, one real number must be communicated.

This changes the analysis a bit!

Page 28: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors


The communication cost

• Purpose:

– derive a multicomputer program for computation of an inner product.

• Given

– Length of vectors M.

– Data xm, ym, m=0,1,…,M-1 of real numbers.

– Number of processors P.

– Set of indices Jp={0,1,…,Ip-1} where the number of entries Ip on the p-th processor is given.

– A one-to-one mapping between global and local indices.

• Compute

= (x,y), i.e., = x[0] y[0] + x[1] y[1] + … + x[M-1] y[M-1] .

Page 29: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors


Program Inner_Product

O,…,P-1 || p Program Inner_Product


i: integer;

w: array[0,1,…,P-1] of real;

x,y: array[Jp] of real;


<; i: iJp :: x[i]=x-1(p,i), y[i]=y -1(p,i)>


w[p] = < +i : iJp :: x[i] y[i]>;

send w[p] to all

<; q: 0 q < P and q p:: receive w[q] from q >;

= < +q: 0 q < P:: w[q] >;


Page 30: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors


Performance modelling of Inner_Product

• Recall Jp = {0,1,…,Ip-1} and I = maxp Ip.

• A model of the execution time for Inner_Product is given by

T(P) = (2I-1) A + (P-1) C(1) + (P-1) A

Here the first term arises from the sum of x[i]y[i] over local i values (Ip multiplications and Ip-1 additions).

The second term arises from the cost of sending one real number from one processors to all others.

The third term arises from the computation of the inner product based on the values on each processor.

Page 31: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors



Assume I = M/P, i.e., a load balanced problem.

Assume (as always) P M, and C(1) = A

(for practical computers is quite large, 50-1000).

We then have

T(P) 2IA + PC(1),


T(P) (2M/P + P)A.

Page 32: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors


Example I

• Choosing M=105 and = 50, we get

T(P) = (2* 105/P + 50P)

200 400 600 800 1000





Page 33: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors


Example II

• Choosing M=107 and = 50, we get

T(P) = (2* 107/P + 50P)

200 400 600 800 1000



1.5 ´106


Page 34: 1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors



For this problem, the speed-up is

S(P) = T(1)/T(P) [(2M+ ) A ] / [(2M/P + P ) A ]

= P [1+/(2M)] / [1+ P2/(2M)].

Optimal speed-up characterized by S(p) P, we must require

P2/(2M) 1

in order for this to be the case.