Upload
rivka
View
46
Download
3
Tags:
Embed Size (px)
DESCRIPTION
Parallel Analysis of Algorithms: PRAM + CGM. Outline. Parallel Performance Parallel Models Shared Memory (PRAM, SMP) Distributed Memory (BSP, CGM). Parallel Analysis of Algorithms. Question?. - PowerPoint PPT Presentation
Citation preview
Parallel Analysis of Algorithms: PRAM + CGM
Parallel Analysis of Algorithms 2
OutlineParallel PerformanceParallel Models Shared Memory (PRAM, SMP) Distributed Memory (BSP, CGM)
Parallel Analysis of Algorithms 3
Question?Professor speedy says he has a parallel algorithm for sorting n arbitrary items in n time using p>1 processors. Do you believe him?
Parallel Analysis of Algorithms
Parallel Analysis of Algorithms 4
Performance of a Parallel Algorithm
n : problem size (e.g.: sort n numbers)p : number of processorsT(p): parallel timeTs : sequential time (optimal sequ. alg.)s(p) = Ts / T(p) : speedup (1sp)
s
p
s(p)=p
super-linear
linear
sub-linear
Parallel Analysis of Algorithms
Parallel Analysis of Algorithms 5
Speeduplinear speedup s(p) = p optimal
super linear speedup s(p) > p : impossible
Proof. Assume that parallel algorithm A has a speedup s > p for processors, i.e. s = Ts / T > p. Hence: Ts > T p. Simulate A on a sequential, single processor machine. Then T(1) = T · p < Ts. Hence, Ts was not optimal. Contradiction.
Parallel Analysis of Algorithms
Parallel Analysis of Algorithms 6
Amdahl’s LawLet f, 0<f<1, be the fraction of a computation that is inherently sequential. Then the maximum obtainable speedup is s <= 1 / [f+(1-f)/p].
Proof: Ts = sequ. time. The T(p) f Ts + (1-f)Ts / p.Hence
s Ts / [f Ts +(1-f) Ts /p] = 1 / [f+(1-f)/p].
Parallel Analysis of Algorithms
Amdahl’s Law
Parallel Analysis of Algorithms 7
Serial section Parallelizable sections(a) One processor
(b) Multipleprocessors
fts (1 - f)tsts
(1 - f)ts/ptp
p processors
Parallel Analysis of Algorithms 8
Amdahl’s Law
P=5
P=1
P=10
time
P=1000
Parallel Analysis of Algorithms
Parallel Analysis of Algorithms 9
s(p) 1 / [f+(1-f)/p]
f 0 : s (p) pf 1 : s(p) 1f = 0.5 : s(p) = 2 [p/(p+1)] <= 2f = 1/k : s(p) = k / [1+(k-1)/p] <= k
Amdahl’s LawParallel Analysis of Algorithms
Parallel Analysis of Algorithms 10
k
s
Parallel Analysis of Algorithms
Parallel Analysis of Algorithms 11
Scaled or Relative Speedup
Ts may be unknown (in fact, for most real experiments this is the case)
Relative speedup s’ (p) = T(1) / T(p)
s’ (p) s(p)
Parallel Analysis of Algorithms
Parallel Analysis of Algorithms 12
Efficiencye(p) = s(p) / p efficiency (0e1)
optimal linear speedup s(p) = p e(p) = 1
e’(p) = s’(p) / p Relative efficiency
Parallel Analysis of Algorithms
Parallel Analysis of Algorithms 13
OutlineParallel Analysis of AlgorithmsModels Shared Memory (PRAM, SMP) Distributed Memory (BSP, CGM)
Parallel Analysis of Algorithms 14
Parallel Random Access Machine (PRAM)
Exclusive-Read (ER)Concurrent-Read (CR)
Exclusive-Write (EW)Concurrent-Write (CW)
proc. 1
proc. 2
proc. 3
proc. i
proc. p
...
...
sharedmemory
1
2
j
n-1n
Shared Memory (PRAM, SMP)
Parallel Analysis of Algorithms 15
Concurrent-Write (CW) Common: All proc. must
write the same value Arbitrary: An arbitrary
value “wins” Smallest: The smallest
value “wins” Priority: The proc. with
smallest ID number “wins”
proc. 1
proc. 2
proc. 3
proc. i
proc. p
...
...
sharedmemory
1
2
j
n-1n
Parallel Random Access Machine (PRAM)
Shared Memory (PRAM, SMP)
Parallel Analysis of Algorithms 16
Default: CREW (Concurrent Read Exclusive Write)
p = O(n) fine grainedmassively parallel
proc. 1
proc. 2
proc. 3
proc. i
proc. p
...
...
sharedmemory
1
2
j
n-1n
Parallel Random Access Machine (PRAM)
Shared Memory (PRAM, SMP)
Parallel Analysis of Algorithms 17
Performance of a PRAM Algorithm
Optimal T = O ( Ts / p )
Efficient T = O ( logk(n) Ts / p )
NC T = O (logk(n) ) for p= polynomial (n)
Shared Memory (PRAM, SMP)
Parallel Analysis of Algorithms 18
Example: Multiply n numbersInput: a1, a2, …, an
Output: a1 * a2 * a3 * … * an
* : associative operator
proc. 1
proc. 2
proc. 3
proc. i
proc. p
...
...
sharedmemory
1
2
j
n-1n
Shared Memory (PRAM, SMP)
Parallel Analysis of Algorithms 19
Algorithm 1
p = n/2
Shared Memory (PRAM, SMP)
Parallel Analysis of Algorithms 20
Analysisp = n/2 T = O( log n )
Ts = O(n), Ts / p = O(1)
algorithm is efficient & NC but not optimal
Shared Memory (PRAM, SMP)
Parallel Analysis of Algorithms 21
Algorithm 2make available only p = n / log n processorsexecute Algorithm 1 using “rescheduling”:
whenever Algorithm 1 has a parallel step where m > (n / log n) processors are used, simulate this step by a “phase” consisting of m / (n / log n) steps for (n / log n) processors
Shared Memory (PRAM, SMP)
Parallel Analysis of Algorithms 22
proc
Shared Memory (PRAM, SMP)
Parallel Analysis of Algorithms 23
Analysis# steps in phase i : (n / 2i) / (n / log n) = log n / 2i
T = O(1in log n / 2i )= O( log n 1in 1/ 2i ) = O( log n )
p = n / log nTs / p = O( n / [n / log n] ) = O( log n )
algorithm is efficient & NC & optimal
Shared Memory (PRAM, SMP)
Problem 2: List RankingInput: A linked list represented by an array.Output: The distance of each node to the last node.
Algorithm: Pointer Jumping
Assign proc. i to node iInitialize (all proc. i in parallel):
D(i) := 0 if P(i)=i1 otherwise
REPEAT log n TIMES (all proc. i in parallel):
D(i) := D(i) + D(P(i))P(i) := P(P(i))
Analysisp = nT = O( log n )
efficient & NC but not optimal
Problem 3: Partial SumsInput: a1, a2, …, an
Output: a1
a1 + a2
a1 + a2 + a3
... a1 + a2 + a3 + … + an
Parallel RecursionCompute (in parallel): a1 + a2 , a3 + a4 , a5 + a6 , ... , an-1 + an
Recursively (all proc. together) solve the problem for the n/2 numbers a1 + a2 , a3 + a4 , a5 + a6 , ... , an-1 + an
The result is: (a1+a2) (a1+a2+a3+a4) (a1+a2+a3+a4+a5+a6 ) ... (a1+a2... an-
3+an-2) (a1+a2+an-1+an)Compute each gap by multiplying its predecessor by a single number
Analysisp = nT (n) = T(n/2) + O(1)T(1) = O(1) T(n) = O(log n)
efficient and NC but not optimal
Improving through rescheduling
set p = n / log nsimulate previous algorithm
proc
Analysis
# steps in phase i : (n / 2i) / (n / log n) = log n / 2i
T = O(1in log n / 2i )= O( log n 1in 1/ 2i ) = O( log n
)p = n / log nTs / p = O( n / [n / log n] ) = O( log n )
algorithm is efficient & NC & optimal
Problem 4: SortingInput: a1, a2, …, an
Output: a1, a2, …, an permuted into sorted order
proc. 1
proc. 2
proc. 3
proc. i
proc. p
...
...
sharedmemory
12
j
n-1n
Unimodal sequence:9 10 13 17 21 19 16 15
Bitonic sequence: cyclic shift of a unimodal sequence
16 15 9 10 13 17 21 19
Bitonic Sorting (Batcher)
Properties of bitonic sequences
X = x1 x2... xn xn+1 xn+2 ... x2n bitonicL(X) = y1 ... yn yi = min {xi, xn+i}U(X) = z1 ... zn zi = max {xi, xn+i}
(1) L(X) and U(X) are bitonic(2) every element of L(X) is smaller
than every element of U(X).
Bitonic Merge: sorting a bitonic sequence
a bitonic sequence of length n can be sorted in time O(log n) using p=n processors
sorting an arbitrary sequence a1, a2, …, an
split a1, a2, …, an into two sub-sequences: a1, …, an/2 and a(n/2)+1, a(n/2)+2, …, an
recursively, in parallel, sort each sub-sequence using p/2 processorsmerge the two sorted sub-sequences into one sorted sequence using bitonic merge
Note: If X and Y are sorted sequences (increasing order), then X YR is a bitonic sequence.
Analysisp = nT (n) = T(n/2) + O(log n)T(1) = O(1) T(n) = O(log2 n)
efficient and NC but not optimal
Parallel Analysis of Algorithms 40
So what about a SMP machine?
PRAM? EREW? CREW? CRCW?How does OpenMP play into this?
Parallel Analysis of Algorithms 41
OpenMP/SMP= CREW PRAM but coarse grainedT(p) f Ts + (1-f)Ts / p, for f = sequential fractionT(n,p) = f Ts + sum over all parallel regions of max time fork
Parallel Regions
Master Thread
Parallel Analysis of Algorithms 42
OutlineParallel Analysis of AlgorithmsModels Shared Memory (PRAM, SMP) Distributed Memory (BSP, CGM)
Distributed Memory Models
Parallel Computingp: # processorsn: problem sizeTs(n): sequential timeT(p,n): parallel time
speedup: S(p,n) = Ts(n) / T(p,n)
Goal: obtain linear speedup S(p,n)=p
Parallel Computers
BeowulfCluster
...
Blue Gene/Q
Cray XK7Custom MPP (Tianhe-2)
Parallel Machine ModelsHow to abstract the machine into a
simplified model such that algorithm/application design is not
hampered by too many details calculated time complexity
predictions match the actually observed running times (with sufficient accuracy)
Parallel Machine ModelsPRAMFine grained networks (array, ring, mesh, hypercube)Bulk Synchronous Parallelism (BSP), Valiant, 1990Coarse Grained Multicomputer (CGM), Dehne, Rau-Chaplin, 1993Multithread (CILK), Leiserson, 1995
many more...
PRAMp=O(n) processors
massively parallel...PPPPPP
shared memory
12
...
n-1n
Example: PRAM Sort
list merge…
Bitonic Sort: O(log n) per merge => O(log2 n)Cole: O(1) per merge => O(log n)
PPPPPP
shared memory
12
...
n-1n
Fine Grained Networks
p=O(n) processors
massively parallel ...
P PP PP PP PP PP PP PP PP PP PP PP PP PP PP PP PP PP PP PP P
Example: Mesh Sort
O(n1/2) time
sub mesh merge
Back to reality...Would anyone use a parallel machine with n processors in order to sort n items ?
Of course NOT…
Typical parallel machines have large ratios n/p (e.g. n/p = 16M)
Brent's TheoremMapping: Fine grained => Coarse grained.Via virtual processorsIf we simulate n virtual processors on p real processors then S(p) = S(n) * p/n
S(n)=O(n) "optimal" => S(p)=O(p) "optimal"
The Problem!• Fine Grained PRAM and Fixed
Network algorithms are VERY slow when implemented on commercial parallel machines.
Why ?
Why ?
p=n
S(p)S(p)
P=n
Why ?
The assumption is not true: in most cases, S(n) is NOT optimal
p=n
S(p)n1/2
n1/2
S(n) = n log n / n1/2
CGM
p=n p
S(p)
Coarse Grained MulticomputerDehne, Rau-Chaplin, 1993
S(p)
pp=n
CGMCoarse grained memoryCoarse grained computation Coarse grained communication
Coarse Grained Multicomputer
Coarse Grained Memory
Ignore small n/p
e.g. assume n/p > p
network orshared memory
proc mem
comm proc mem
comm
proc mem
comm
proc mem
comm
proc mem
comm
proc mem
comm
n/p
n/p
n/p
n/p
n/p
n/p
n/p
proc mem
comm
Coarse Grained Comp.
PPPP
time
round 1 round 3round 2
Compute in supersteps with barrier synchronization (as in BSP)
Coarse Grained Comm.
PPPP
time
round 1 round 3round 2
• All communication steps are h-relations, h=O(n/p)• No individual messages
h-relat ion
h-relat ion
h-relat ion
h-Relationproc
mem
comm
procmem
comm
procmem
comm
procmem
comm
h=O(n/p)
CGM
• Complexity measure:
– number of rounds (e.g. O(1), O(log p), …)– scalability (e.g. n/p > p)– local computation
– communication volume
Coarse Grained Multicomputer
CGM
• Coarse grained memory• Coarse grained computation • Coarse grained communication
=> - practical parallel algorithms- efficient and portable
Coarse Grained Multicomputer
Det. Sample SortCGM Algorithm:
sort locally and create p-sample
n/p
procmem
comm
procmem
comm
procmem
comm
procmem
comm
data
p-sample
data
p-sample
data
p-sample
data
p-sample
Det. Sample SortCGM Algorithm:
• send all p-samples to processor 1
n/p
procmem
comm
procmem
comm
procmem
comm
procmem
comm
data
p-sample
data
p-sample
data
p-sample
data
p-sample
Det. Sample SortCGM Algorithm:
• proc.1: sort all received samples and compute global p-sample
n/p
procmem
comm
procmem
comm
procmem
comm
procmem
comm
data
p-sample
data
p-sample
data
p-sample
data
p-sample
Det. Sample SortCGM Algorithm:
• broadcast global p-sample
• bucket locally according to global p-sample
• send bucket i to proc.i
• resort locallyn/p
procmem
comm
procmem
comm
procmem
comm
procmem
comm
data
p-sample
data
p-sample
data
p-sample
data
p-sample
Det. Sample SortCGM Algorithm:
• O(1) roundsfor n/p > p2
• O(n/p log n) local comp.
• Goodrich (FOCS'98): O(1) roundsfor n/p > pe
n/p
procmem
comm
procmem
comm
procmem
comm
procmem
comm
data
p-sample
data
p-sample
data
p-sample
data
p-sample