Parallel Analysis of Algorithms: PRAM + CGM

Parallel Analysis of Algorithms: PRAM + CGM

Parallel Analysis of Algorithms 2

OutlineParallel PerformanceParallel Models Shared Memory (PRAM, SMP) Distributed Memory (BSP, CGM)


Question?Professor speedy says he has a parallel algorithm for sorting n arbitrary items in n time using p>1 processors. Do you believe him?

Parallel Analysis of Algorithms


Performance of a Parallel Algorithm

n : problem size (e.g.: sort n numbers)p : number of processorsT(p): parallel timeTs : sequential time (optimal sequ. alg.)s(p) = Ts / T(p) : speedup (1sp)

s

p

s(p)=p

super-linear

linear

sub-linear



Speeduplinear speedup s(p) = p optimal

super linear speedup s(p) > p : impossible

Proof. Assume that parallel algorithm A has a speedup s > p for processors, i.e. s = Ts / T > p. Hence: Ts > T p. Simulate A on a sequential, single processor machine. Then T(1) = T · p < Ts. Hence, Ts was not optimal. Contradiction.



Amdahl’s LawLet f, 0<f<1, be the fraction of a computation that is inherently sequential. Then the maximum obtainable speedup is s <= 1 / [f+(1-f)/p].

Proof: Ts = sequ. time. The T(p) f Ts + (1-f)Ts / p.Hence

s Ts / [f Ts +(1-f) Ts /p] = 1 / [f+(1-f)/p].


Amdahl’s Law


Serial section Parallelizable sections(a) One processor

(b) Multipleprocessors

fts (1 - f)tsts

(1 - f)ts/ptp

p processors


Amdahl’s Law

P=5

P=1

P=10

time

P=1000



s(p) 1 / [f+(1-f)/p]

f 0 : s (p) pf 1 : s(p) 1f = 0.5 : s(p) = 2 [p/(p+1)] <= 2f = 1/k : s(p) = k / [1+(k-1)/p] <= k

Amdahl’s LawParallel Analysis of Algorithms


k

s



Scaled or Relative Speedup

Ts may be unknown (in fact, for most real experiments this is the case)

Relative speedup s’ (p) = T(1) / T(p)

s’ (p) s(p)



Efficiencye(p) = s(p) / p efficiency (0e1)

optimal linear speedup s(p) = p e(p) = 1

e’(p) = s’(p) / p Relative efficiency



OutlineParallel Analysis of AlgorithmsModels Shared Memory (PRAM, SMP) Distributed Memory (BSP, CGM)


Parallel Random Access Machine (PRAM)

Exclusive-Read (ER)Concurrent-Read (CR)

Exclusive-Write (EW)Concurrent-Write (CW)

proc. 1

proc. 2

proc. 3

proc. i

proc. p

...

...

sharedmemory

1

2

j

n-1n

Shared Memory (PRAM, SMP)


Concurrent-Write (CW) Common: All proc. must

write the same value Arbitrary: An arbitrary

value “wins” Smallest: The smallest

value “wins” Priority: The proc. with

smallest ID number “wins”

proc. 1

proc. 2

proc. 3

proc. i

proc. p

...

...

sharedmemory

1

2

j

n-1n




Default: CREW (Concurrent Read Exclusive Write)

p = O(n) fine grainedmassively parallel

proc. 1

proc. 2

proc. 3

proc. i

proc. p

...

...

sharedmemory

1

2

j

n-1n




Performance of a PRAM Algorithm

Optimal T = O ( Ts / p )

Efficient T = O ( logk(n) Ts / p )

NC T = O (logk(n) ) for p= polynomial (n)



Example: Multiply n numbersInput: a1, a2, …, an

Output: a1 * a2 * a3 * … * an

* : associative operator

proc. 1

proc. 2

proc. 3

proc. i

proc. p

...

...

sharedmemory

1

2

j

n-1n



Algorithm 1

p = n/2



Analysisp = n/2 T = O( log n )

Ts = O(n), Ts / p = O(1)

algorithm is efficient & NC but not optimal



Algorithm 2make available only p = n / log n processorsexecute Algorithm 1 using “rescheduling”:

whenever Algorithm 1 has a parallel step where m > (n / log n) processors are used, simulate this step by a “phase” consisting of m / (n / log n) steps for (n / log n) processors



proc



Analysis# steps in phase i : (n / 2i) / (n / log n) = log n / 2i

T = O(1in log n / 2i )= O( log n 1in 1/ 2i ) = O( log n )

p = n / log nTs / p = O( n / [n / log n] ) = O( log n )

algorithm is efficient & NC & optimal


Problem 2: List RankingInput: A linked list represented by an array.Output: The distance of each node to the last node.

Algorithm: Pointer Jumping

Assign proc. i to node iInitialize (all proc. i in parallel):

D(i) := 0 if P(i)=i1 otherwise

REPEAT log n TIMES (all proc. i in parallel):

D(i) := D(i) + D(P(i))P(i) := P(P(i))

Analysisp = nT = O( log n )

efficient & NC but not optimal

Problem 3: Partial SumsInput: a1, a2, …, an

Output: a1

a1 + a2

a1 + a2 + a3

... a1 + a2 + a3 + … + an

Parallel RecursionCompute (in parallel): a1 + a2 , a3 + a4 , a5 + a6 , ... , an-1 + an

Recursively (all proc. together) solve the problem for the n/2 numbers a1 + a2 , a3 + a4 , a5 + a6 , ... , an-1 + an

The result is: (a1+a2) (a1+a2+a3+a4) (a1+a2+a3+a4+a5+a6 ) ... (a1+a2... an-

3+an-2) (a1+a2+an-1+an)Compute each gap by multiplying its predecessor by a single number

Analysisp = nT (n) = T(n/2) + O(1)T(1) = O(1) T(n) = O(log n)

efficient and NC but not optimal

Improving through rescheduling

set p = n / log nsimulate previous algorithm

proc

Analysis

# steps in phase i : (n / 2i) / (n / log n) = log n / 2i

T = O(1in log n / 2i )= O( log n 1in 1/ 2i ) = O( log n

)p = n / log nTs / p = O( n / [n / log n] ) = O( log n )

algorithm is efficient & NC & optimal

Problem 4: SortingInput: a1, a2, …, an

Output: a1, a2, …, an permuted into sorted order

proc. 1

proc. 2

proc. 3

proc. i

proc. p

...

...

sharedmemory

12

j

n-1n

Unimodal sequence:9 10 13 17 21 19 16 15

Bitonic sequence: cyclic shift of a unimodal sequence

16 15 9 10 13 17 21 19

Bitonic Sorting (Batcher)

Properties of bitonic sequences

X = x1 x2... xn xn+1 xn+2 ... x2n bitonicL(X) = y1 ... yn yi = min {xi, xn+i}U(X) = z1 ... zn zi = max {xi, xn+i}

(1) L(X) and U(X) are bitonic(2) every element of L(X) is smaller

than every element of U(X).

Bitonic Merge: sorting a bitonic sequence

a bitonic sequence of length n can be sorted in time O(log n) using p=n processors

sorting an arbitrary sequence a1, a2, …, an

split a1, a2, …, an into two sub-sequences: a1, …, an/2 and a(n/2)+1, a(n/2)+2, …, an

recursively, in parallel, sort each sub-sequence using p/2 processorsmerge the two sorted sub-sequences into one sorted sequence using bitonic merge

Note: If X and Y are sorted sequences (increasing order), then X YR is a bitonic sequence.

Analysisp = nT (n) = T(n/2) + O(log n)T(1) = O(1) T(n) = O(log2 n)

efficient and NC but not optimal


So what about a SMP machine?

PRAM? EREW? CREW? CRCW?How does OpenMP play into this?


OpenMP/SMP= CREW PRAM but coarse grainedT(p) f Ts + (1-f)Ts / p, for f = sequential fractionT(n,p) = f Ts + sum over all parallel regions of max time fork

Parallel Regions

Master Thread


OutlineParallel Analysis of AlgorithmsModels Shared Memory (PRAM, SMP) Distributed Memory (BSP, CGM)

Distributed Memory Models

Parallel Computingp: # processorsn: problem sizeTs(n): sequential timeT(p,n): parallel time

speedup: S(p,n) = Ts(n) / T(p,n)

Goal: obtain linear speedup S(p,n)=p

Parallel Computers

BeowulfCluster

...

Blue Gene/Q

Cray XK7Custom MPP (Tianhe-2)

Parallel Machine ModelsHow to abstract the machine into a

simplified model such that algorithm/application design is not

hampered by too many details calculated time complexity

predictions match the actually observed running times (with sufficient accuracy)

Parallel Machine ModelsPRAMFine grained networks (array, ring, mesh, hypercube)Bulk Synchronous Parallelism (BSP), Valiant, 1990Coarse Grained Multicomputer (CGM), Dehne, Rau-Chaplin, 1993Multithread (CILK), Leiserson, 1995

many more...

PRAMp=O(n) processors

massively parallel...PPPPPP

shared memory

12

...

n-1n

Example: PRAM Sort

list merge…

Bitonic Sort: O(log n) per merge => O(log2 n)Cole: O(1) per merge => O(log n)

PPPPPP

shared memory

12

...

n-1n

Fine Grained Networks

p=O(n) processors

massively parallel ...

P PP PP PP PP PP PP PP PP PP PP PP PP PP PP PP PP PP PP PP P

Example: Mesh Sort

O(n1/2) time

sub mesh merge

Back to reality...Would anyone use a parallel machine with n processors in order to sort n items ?

Of course NOT…

Typical parallel machines have large ratios n/p (e.g. n/p = 16M)

Brent's TheoremMapping: Fine grained => Coarse grained.Via virtual processorsIf we simulate n virtual processors on p real processors then S(p) = S(n) * p/n

S(n)=O(n) "optimal" => S(p)=O(p) "optimal"

The Problem!• Fine Grained PRAM and Fixed

Network algorithms are VERY slow when implemented on commercial parallel machines.

Why ?

Why ?

p=n

S(p)S(p)

P=n

Why ?

The assumption is not true: in most cases, S(n) is NOT optimal

p=n

S(p)n1/2

n1/2

S(n) = n log n / n1/2

CGM

p=n p

S(p)

Coarse Grained MulticomputerDehne, Rau-Chaplin, 1993

S(p)

pp=n

CGMCoarse grained memoryCoarse grained computation Coarse grained communication

Coarse Grained Multicomputer

Coarse Grained Memory

Ignore small n/p

e.g. assume n/p > p

network orshared memory

proc mem

comm proc mem

comm

proc mem

comm

proc mem

comm

proc mem

comm

proc mem

comm

n/p

n/p

n/p

n/p

n/p

n/p

n/p

proc mem

comm

Coarse Grained Comp.

PPPP

time

round 1 round 3round 2

Compute in supersteps with barrier synchronization (as in BSP)

Coarse Grained Comm.

PPPP

time

round 1 round 3round 2

• All communication steps are h-relations, h=O(n/p)• No individual messages

h-relat ion

h-relat ion

h-relat ion

h-Relationproc

mem

comm

procmem

comm

procmem

comm

procmem

comm

h=O(n/p)

CGM

• Complexity measure:

– number of rounds (e.g. O(1), O(log p), …)– scalability (e.g. n/p > p)– local computation

– communication volume


CGM

• Coarse grained memory• Coarse grained computation • Coarse grained communication

=> - practical parallel algorithms- efficient and portable


Det. Sample SortCGM Algorithm:

sort locally and create p-sample

n/p

procmem

comm

procmem

comm

procmem

comm

procmem

comm

data

p-sample

data

p-sample

data

p-sample

data

p-sample


• send all p-samples to processor 1

n/p

procmem

comm

procmem

comm

procmem

comm

procmem

comm

data

p-sample

data

p-sample

data

p-sample

data

p-sample


• proc.1: sort all received samples and compute global p-sample

n/p

procmem

comm

procmem

comm

procmem

comm

procmem

comm

data

p-sample

data

p-sample

data

p-sample

data

p-sample


• broadcast global p-sample

• bucket locally according to global p-sample

• send bucket i to proc.i

• resort locallyn/p

procmem

comm

procmem

comm

procmem

comm

procmem

comm

data

p-sample

data

p-sample

data

p-sample

data

p-sample


• O(1) roundsfor n/p > p2

• O(n/p log n) local comp.

• Goodrich (FOCS'98): O(1) roundsfor n/p > pe

n/p

procmem

comm

procmem

comm

procmem

comm

procmem

comm

data

p-sample

data

p-sample

data

p-sample

data

p-sample

Documents

Parallel Analysis of Algorithms: PRAM + CGM