Upload
emil-griffith
View
219
Download
0
Embed Size (px)
Citation preview
Parallel and Distributed Algorithms
Eric Vidal
Reference: R. Johnsonbaugh and M. Schaefer, Algorithms (International Edition). 2004. Pearson
Education.
Outline
• Introduction (case study: maximum element)– Work-optimality
• The Parallel Random Access Machine– Shared memory modes– Accelerated cascading
• Other Parallel Architectures (case study: sorting)– Circuits– Linear processor networks– (Mesh processor networks)
• Distributed Algorithms– Message-optimality– Broadcast and echo– (Leader election)
Introduction
Why use parallelism?
• p steps on 1 printer, 1 step on p printers• p = speed-up factor (best case)• Given a sequential algorithm, how can we
parallelize it?– Some are inherently sequential (P-complete)
Case Study: Maximum ElementIn: a[]Out: maximum element in a
sequential_maximum(a) { n = a.length max = a[0] for i = 1 to n – 1 { if (a[i] > max) max = a[i] } return max}
21 11 23 17 48 33 22 41
21
23
23
48
48
48
48
O(n)
Parallel Maximum
• Idea: Use ⌈n / 2 processors⌉
• Note idle processors after the first step!
21 11 23 17 48 33 22 41
21 23 48 41
23 48
48
O(lg n)
Work-Optimality
• Work = number of algorithmic steps × number of processors
• Running time of parallelized maximum algo = O(lg n) × (n / 2) = O(n lg n)
• Not work-optimal! Sequential algo’s work is O(n)– Workaround: accelerated cascading…
Formal Algorithm for Parallel Maximum
• But first!...
The Parallel Random Access Machine
The Parallel Random Access Machine (PRAM)
• New construct: parallel loop
for i = 1 to n in parallel { … }
• Assumption 1: use n processors to execute this loop (processors are synchronized)
• Assumption 2: memory shared across all processors
Example: Parallel Search In: a[], xOut: true if x is in a, false otherwise
parallel_search(a, x) { n = a.length found = false for i = 0 to n – 1 in parallel { if (a[i] == x) found = true } return found}
Is this work-optimal?
Shared memory modes:•Exclusive Read (ER)•Concurrent Read (CR)•Exclusive Write (EW)•Concurrent Write (CW)
Real-world systems are most commonly CREW
parallel_search runs on what type?
Formal Algorithm for Parallel Maximum
In: a[]Out: maximum element in a
parallel_maximum(a) { n = a.length for i = 0 to lg ⌈ n – 1 {⌉ for j = 0 to ⌈n/2i+1 – 1 ⌉ in parallel { if (j × 2i+1 + 2i < n) // boundary check a[j × 2i+1] = max(a[j × 2i+1], a[j × 2i+1 + 2i]) } } return a[0]}
Theorem: parallel_maximum is CREW and finds the maximum element in parallel time O(lg n) and work O(n lg n)
Accelerated Cascading
• Phase 1: Use sequential_maximum on blocks of lg n elements– We use n / lg n processors– O(lg n) sequential steps per processor– Total work = O(lg n) steps × (n / lg n) processors = O(n)
• Phase 2: Use parallel_maximum on the resulting n / lg n elements– lg (n / lg n) parallel steps = lg n – lg (lg n) = O(lg n)– Total work = O(lg n) steps × ((n / lg n) / 2) processors =
O(n)
Formal Algorithm for Optimal Maximum
In: a[]Out: maximum element in a
optimal_maximum(a) { n = a.length block_size = lg ⌈ n⌉ block_count = ⌈n / block_size⌉ create array block_results[block_count] for i = 0 to block_count – 1 in parallel { start = i × block_size end = min(n – 1, start + block_size – 1) block_results[i] = sequential_maximum(a[start .. end]) } return parallel_maximum(block_results)}
Some Notes
• All CR algorithms can be converted to ER algorithms!– “Broadcasting” an ER variable to all processors for
concurrent access takes O(lg n) parallel time• maximum is a “semigroup algorithm”– Semigroup = a set of elements + an associative
binary relation (max, min, +, ×, etc.)– Same accelerated-cascading methods can be
applied for min-element, summation, product of n numbers, etc.!
Other Parallel Architectures
PRAM may not be the best model• Shared memory = expensive!– Some algorithms require communication between
processors (= memory locking issues)– Better to use channels!
• Extreme case: very simple processors with no shared memory (just communication channels)
Circuits
• Each processor is a gate with a specialized function (e.g., comparator gate)
• Circuit = a layout of gates to perform a full task (e.g., sorting)
x
y
min(x, y)
max(x, y)
Sorting circuit for 4 elements (depth 3)
Step 1 Step 2 Step 3(Depth of network = 3)
17
42
23
7
17
42
7
23
23
17
7
17
23
42
7
42
Sorting circuit for n elements?
• Simpler problem: max element
• Idea: Add as many of these diagonals as needed
Odd-Even Transposition Network
• Theorem: The odd-even transposition network sorts n numbers in n steps and O(n2) processors
18
42
31
56
12
11
19
34
18
42
31
56
11
12
19
34
18
31
42
11
56
12
19
34
18
31
11
42
12
56
19
34
18
11
31
12
42
19
56
34
11
18
12
31
19
42
34
56
11
12
18
19
31
34
42
56
11
12
18
19
31
34
42
56
11
12
18
19
31
34
42
56
Zero-One Principle of Sorting Networks
• Lemma: If a sorting network works correctly on all inputs consisting of only 0’s and 1’s, it works for any arbitrary input– Assume there is a network that sorts 0-1 sequences
but not another arbitrary input a0 .. an-1
– Let b0 .. bn-1 be the output of that network– There must exist s < t such that bs > bt
– Label all ai < bs with 0 and all else with 1– If we run all a0 .. an-1 with their labels, then bs’s label
will be 1 and bt’s label will be 0– Contradiction: The network is assumed to sort 0-1
sequences properly but did not do so here!
Correctness of the Odd-Even Transposition Network
• Assume binary sequence a0 .. an–1
• Let ai = first 0 in the sequence• Two cases: i is odd or even
• To sort a0 .. ai, we need i steps (worst-case)• Induction: Given a0 .. ak (where k ≥ i) will sort in k
steps, will a0 .. ak+1 get sorted in k+1 steps?
1
1
1
0
1
1
0
1
1
0
1
0
1
0 1
1
1
10
1
1
1
10
1
1
1
0
1
1
0
0
1 0
1
Better Sorting Networks
• Batcher’s Bitonic Sorter (1968)– Depth O(lg2 n), size O(n lg2 n)– Idea: sort 2 groups (recursively), then merge using
a network that can sort bitonic sequences
• AKS Network (1983)– Ajtai, Komlós and Szemerédl– Depth O(lg n), size O(n lg n)– Not practical! Hides a very large c in the cn lg n
algorithm
More Intelligent Processors:Processor Networks
• Star
• Linear/Ring
• Completely-connected
• Mesh
Diameter = 2
Diameter = n – 1 (or n – 2)
Diameter = 1
Sorting on Linear Networks
• Emulate an odd-even transposition network!
• O(n) steps, work is O(n2)– We can’t expect better on a linear network
18 42 31 56 12 11 19
18 42 31 56 11 12 19
18 31 42 11 56 12 19
18 31 11 42 12 56 19
18 11 31 12 42 19 56
11 18 12 31 19 42 56
11 12 18 19 31 42 56
11 12 18 19 31 42 56
Sorting on Mesh Networks: Shearsort
• Arrange numbers in “boustrophedon” order
a = { 15, 4, 10, 6, 1, 5, 7, 11, 12, 14, 13, 8, 9, 16, 2, 3 }
Row phase
• Sort rows, sort columns, repeat
15 4 10 611 7 5 112 14 13 89 16 2 3
Sorting on Mesh Networks: Shearsort
• Arrange numbers in “boustrophedon” order
a = { 15, 4, 10, 6, 1, 5, 7, 11, 12, 14, 13, 8, 9, 16, 2, 3 }
Column phase
• Sort rows, sort columns, repeat
4 6 10 1511 7 5 18 12 13 14
16 9 3 2
Sorting on Mesh Networks: Shearsort
• Arrange numbers in “boustrophedon” order
a = { 15, 4, 10, 6, 1, 5, 7, 11, 12, 14, 13, 8, 9, 16, 2, 3 }
Row phase
• Sort rows, sort columns, repeat
4 6 3 18 7 5 2
11 9 10 1416 12 13 15
Sorting on Mesh Networks: Shearsort
• Arrange numbers in “boustrophedon” order
a = { 15, 4, 10, 6, 1, 5, 7, 11, 12, 14, 13, 8, 9, 16, 2, 3 }
Column phase
• Sort rows, sort columns, repeat
1 3 4 68 7 5 29 10 11 14
16 15 13 12
Sorting on Mesh Networks: Shearsort
• Arrange numbers in “boustrophedon” order
a = { 15, 4, 10, 6, 1, 5, 7, 11, 12, 14, 13, 8, 9, 16, 2, 3 }
Row phase
• Sort rows, sort columns, repeat
1 3 4 28 7 5 69 10 11 12
16 15 13 14
Sorting on Mesh Networks: Shearsort
• Arrange numbers in “boustrophedon” order
a = { 15, 4, 10, 6, 1, 5, 7, 11, 12, 14, 13, 8, 9, 16, 2, 3 }
Column phase
• Sort rows, sort columns, repeat
1 2 3 48 7 6 59 10 11 12
16 15 14 13
Sorting on Mesh Networks: Shearsort
• Arrange numbers in “boustrophedon” order
a = { 15, 4, 10, 6, 1, 5, 7, 11, 12, 14, 13, 8, 9, 16, 2, 3 }
Done!
• Sort rows, sort columns, repeat
1 2 3 48 7 6 59 10 11 12
16 15 14 13
Sorting on Mesh Networks: Shearsort
• Theorem: Shearsort sorts n2 elements in O(n lg n) steps on an n × n mesh
• We can use the Zero-One Principle!– Only because algorithm is comparison-exchange• Can be implemented using comparators only
– and oblivious• Outcome of comparator does not influence
comparisons made later on– (Disclaimer: reference is actually very unclear
about this)
Correctness of Shearsort
0 1 0 0 1 0 0 10 1 1 1 0 1 1 10 0 1 0 1 0 0 11 0 0 1 0 0 1 01 1 1 0 1 0 1 00 0 0 1 1 1 0 10 0 1 1 1 1 1 11 1 0 0 1 1 1 1
Correctness of Shearsort
0 0 0 0 0 1 1 11 1 1 1 1 1 0 00 0 0 0 0 1 1 11 1 1 0 0 0 0 00 0 0 1 1 1 1 11 1 1 1 0 0 0 00 0 1 1 1 1 1 11 1 1 1 1 1 0 0
1 full row of 1’s
1 full row of 0’s
1 full row of 1’s
1 full row of 1’s
Correctness of Shearsort
• lg(n) × 2 phases, each phase takes n steps
0 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 1 0 00 0 1 1 0 1 0 01 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 1
Sort spaceguaranteed to
be halvedafter 2 phases
Distributed Algorithms
Different concerns altogether…
• Problems usually easy to parallelize• Main problems:
– Inherently asynchronous– How to broadcast data and ensure every node gets it– How to minimize bandwidth usage– What to do when nodes go down (decentralization)– (Do we trust the results given by the nodes?)
2, 3, 5, 7, 13 …
… 242643801-1, 243112609-1…DES (56-bit) SETI@Home
Message-Optimality
• New language constructs:
send <M> to p receive <M> from p terminate
• Message-complexity = number of messages sent by a distributed algorithm (also uses O-notation)
Broadcast
• Initiators vs. noninitiators
• Simple case: ring network w/ one initiator
init_ring_broadcast() { send token to successor receive token from predecessor terminate}
ring_broadcast() { receive token from predecessor send token to successor terminate}
Theorem: init_ring_broadcast + ring_broadcast broadcasts to n machines using time and message complexity O(n)
Broadcast on a tree networkinit_broadcast() { N = { q | q is a child neighbor of p } for each q ∈ N send token to q terminate}
broadcast() { receive token from parent N = { q | q is a child neighbor of p } for each q ∈ N send token to q terminate}
Note: no acknowledgment!
2
1
3
6
4
5
Echo
• Creates a spanning tree out of any connected network
init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate}
echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate}
2
1
3
6
4
5
Echo
• Creates a spanning tree out of any connected network
init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate}
echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate}
2
1
3
6
4
5
0
nul
nul
nul
nul
nul
Echo
• Creates a spanning tree out of any connected network
init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate}
echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate}
2
1
3
6
4
5
0
0
nul
nul
nul
nul
Echo
• Creates a spanning tree out of any connected network
init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate}
echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate}
2
1
3
6
4
5
0
0
nul
nul
0
nul
Echo
• Creates a spanning tree out of any connected network
init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate}
echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate}
2
1
3
6
4
5
0
0
nul
nul
0
0
Echo
• Creates a spanning tree out of any connected network
init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate}
echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate}
2
1
3
6
4
5
1
0
0
0
1
2
Echo
• Creates a spanning tree out of any connected network
init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate}
echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate}
2
1
3
6
4
5
1
0
2
2
3
2
Echo
• Creates a spanning tree out of any connected network
init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate}
echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate}
2
1
3
6
4
5
2
0
fin
fin
fin
4
Echo
• Creates a spanning tree out of any connected network
init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate}
echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate}
2
1
3
6
4
5
2
1
fin
fin
fin
fin
Echo
• Creates a spanning tree out of any connected network
init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate}
echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate}
2
1
3
6
4
5
3
fin
fin
fin
fin
fin
Echo
• Creates a spanning tree out of any connected network
Theorem: init_echo + echo has time complexity O(diameter) and message complexity O(edges)
init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate}
echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate}
2
1
3
6
4
5
fin
fin
fin
fin
fin
fin
Leader Election (for ring networks)init_election() {
send token, p.ID to successor min = p.ID receive token, token_id while (p.ID != token_id) { if token_id < min min = token_id send token, token_id to successor receive token, token_id }
if (p.ID == min) i_am_the_leader = true else i_am_the_leader = false terminate
}
election() {
i_am_the_leader = false do { receive token, token_id send token, token_id to successor } while (true)
}
Theorem: init_election + election runs in n steps with message complexity O(n2)