Parallel and Distributed Algorithms Eric Vidal Reference: R. Johnsonbaugh and M. Schaefer, Algorithms (International Edition). 2004. Pearson Education

Parallel and Distributed Algorithms

Eric Vidal

Reference: R. Johnsonbaugh and M. Schaefer, Algorithms (International Edition). 2004. Pearson

Education.

Outline

• Introduction (case study: maximum element)– Work-optimality

• The Parallel Random Access Machine– Shared memory modes– Accelerated cascading

• Other Parallel Architectures (case study: sorting)– Circuits– Linear processor networks– (Mesh processor networks)

• Distributed Algorithms– Message-optimality– Broadcast and echo– (Leader election)

Introduction

Why use parallelism?

• p steps on 1 printer, 1 step on p printers• p = speed-up factor (best case)• Given a sequential algorithm, how can we

parallelize it?– Some are inherently sequential (P-complete)

Case Study: Maximum ElementIn: a[]Out: maximum element in a

sequential_maximum(a) { n = a.length max = a[0] for i = 1 to n – 1 { if (a[i] > max) max = a[i] } return max}

21 11 23 17 48 33 22 41

21

23

23

48

48

48

48

O(n)

Parallel Maximum

• Idea: Use ⌈n / 2 processors⌉

• Note idle processors after the first step!

21 11 23 17 48 33 22 41

21 23 48 41

23 48

48

O(lg n)

Work-Optimality

• Work = number of algorithmic steps × number of processors

• Running time of parallelized maximum algo = O(lg n) × (n / 2) = O(n lg n)

• Not work-optimal! Sequential algo’s work is O(n)– Workaround: accelerated cascading…

Formal Algorithm for Parallel Maximum

• But first!...

The Parallel Random Access Machine

The Parallel Random Access Machine (PRAM)

• New construct: parallel loop

for i = 1 to n in parallel { … }

• Assumption 1: use n processors to execute this loop (processors are synchronized)

• Assumption 2: memory shared across all processors

Example: Parallel Search In: a[], xOut: true if x is in a, false otherwise

parallel_search(a, x) { n = a.length found = false for i = 0 to n – 1 in parallel { if (a[i] == x) found = true } return found}

Is this work-optimal?

Shared memory modes:•Exclusive Read (ER)•Concurrent Read (CR)•Exclusive Write (EW)•Concurrent Write (CW)

Real-world systems are most commonly CREW

parallel_search runs on what type?

Formal Algorithm for Parallel Maximum

In: a[]Out: maximum element in a

parallel_maximum(a) { n = a.length for i = 0 to lg ⌈ n – 1 {⌉ for j = 0 to ⌈n/2i+1 – 1 ⌉ in parallel { if (j × 2i+1 + 2i < n) // boundary check a[j × 2i+1] = max(a[j × 2i+1], a[j × 2i+1 + 2i]) } } return a[0]}

Theorem: parallel_maximum is CREW and finds the maximum element in parallel time O(lg n) and work O(n lg n)

Accelerated Cascading

• Phase 1: Use sequential_maximum on blocks of lg n elements– We use n / lg n processors– O(lg n) sequential steps per processor– Total work = O(lg n) steps × (n / lg n) processors = O(n)

• Phase 2: Use parallel_maximum on the resulting n / lg n elements– lg (n / lg n) parallel steps = lg n – lg (lg n) = O(lg n)– Total work = O(lg n) steps × ((n / lg n) / 2) processors =

O(n)

Formal Algorithm for Optimal Maximum

In: a[]Out: maximum element in a

optimal_maximum(a) { n = a.length block_size = lg ⌈ n⌉ block_count = ⌈n / block_size⌉ create array block_results[block_count] for i = 0 to block_count – 1 in parallel { start = i × block_size end = min(n – 1, start + block_size – 1) block_results[i] = sequential_maximum(a[start .. end]) } return parallel_maximum(block_results)}

Some Notes

• All CR algorithms can be converted to ER algorithms!– “Broadcasting” an ER variable to all processors for

concurrent access takes O(lg n) parallel time• maximum is a “semigroup algorithm”– Semigroup = a set of elements + an associative

binary relation (max, min, +, ×, etc.)– Same accelerated-cascading methods can be

applied for min-element, summation, product of n numbers, etc.!

Other Parallel Architectures

PRAM may not be the best model• Shared memory = expensive!– Some algorithms require communication between

processors (= memory locking issues)– Better to use channels!

• Extreme case: very simple processors with no shared memory (just communication channels)

Circuits

• Each processor is a gate with a specialized function (e.g., comparator gate)

• Circuit = a layout of gates to perform a full task (e.g., sorting)

x

y

min(x, y)

max(x, y)

Sorting circuit for 4 elements (depth 3)

Step 1 Step 2 Step 3(Depth of network = 3)

17

42

23

7

17

42

7

23

23

17

7

17

23

42

7

42

Sorting circuit for n elements?

• Simpler problem: max element

• Idea: Add as many of these diagonals as needed

Odd-Even Transposition Network

• Theorem: The odd-even transposition network sorts n numbers in n steps and O(n2) processors

18

42

31

56

12

11

19

34

18

42

31

56

11

12

19

34

18

31

42

11

56

12

19

34

18

31

11

42

12

56

19

34

18

11

31

12

42

19

56

34

11

18

12

31

19

42

34

56

11

12

18

19

31

34

42

56

11

12

18

19

31

34

42

56

11

12

18

19

31

34

42

56

Zero-One Principle of Sorting Networks

• Lemma: If a sorting network works correctly on all inputs consisting of only 0’s and 1’s, it works for any arbitrary input– Assume there is a network that sorts 0-1 sequences

but not another arbitrary input a0 .. an-1

– Let b0 .. bn-1 be the output of that network– There must exist s < t such that bs > bt

– Label all ai < bs with 0 and all else with 1– If we run all a0 .. an-1 with their labels, then bs’s label

will be 1 and bt’s label will be 0– Contradiction: The network is assumed to sort 0-1

sequences properly but did not do so here!

Correctness of the Odd-Even Transposition Network

• Assume binary sequence a0 .. an–1

• Let ai = first 0 in the sequence• Two cases: i is odd or even

• To sort a0 .. ai, we need i steps (worst-case)• Induction: Given a0 .. ak (where k ≥ i) will sort in k

steps, will a0 .. ak+1 get sorted in k+1 steps?

1

1

1

0

1

1

0

1

1

0

1

0

1

0 1

1

1

10

1

1

1

10

1

1

1

0

1

1

0

0

1 0

1

Better Sorting Networks

• Batcher’s Bitonic Sorter (1968)– Depth O(lg2 n), size O(n lg2 n)– Idea: sort 2 groups (recursively), then merge using

a network that can sort bitonic sequences

• AKS Network (1983)– Ajtai, Komlós and Szemerédl– Depth O(lg n), size O(n lg n)– Not practical! Hides a very large c in the cn lg n

algorithm

More Intelligent Processors:Processor Networks

• Star

• Linear/Ring

• Completely-connected

• Mesh

Diameter = 2

Diameter = n – 1 (or n – 2)

Diameter = 1

Sorting on Linear Networks

• Emulate an odd-even transposition network!

• O(n) steps, work is O(n2)– We can’t expect better on a linear network

18 42 31 56 12 11 19

18 42 31 56 11 12 19

18 31 42 11 56 12 19

18 31 11 42 12 56 19

18 11 31 12 42 19 56

11 18 12 31 19 42 56

11 12 18 19 31 42 56

11 12 18 19 31 42 56

Sorting on Mesh Networks: Shearsort

• Arrange numbers in “boustrophedon” order

a = { 15, 4, 10, 6, 1, 5, 7, 11, 12, 14, 13, 8, 9, 16, 2, 3 }

Row phase

• Sort rows, sort columns, repeat

15 4 10 611 7 5 112 14 13 89 16 2 3



a = { 15, 4, 10, 6, 1, 5, 7, 11, 12, 14, 13, 8, 9, 16, 2, 3 }

Column phase


4 6 10 1511 7 5 18 12 13 14

16 9 3 2



a = { 15, 4, 10, 6, 1, 5, 7, 11, 12, 14, 13, 8, 9, 16, 2, 3 }

Row phase


4 6 3 18 7 5 2

11 9 10 1416 12 13 15



a = { 15, 4, 10, 6, 1, 5, 7, 11, 12, 14, 13, 8, 9, 16, 2, 3 }

Column phase


1 3 4 68 7 5 29 10 11 14

16 15 13 12



a = { 15, 4, 10, 6, 1, 5, 7, 11, 12, 14, 13, 8, 9, 16, 2, 3 }

Row phase


1 3 4 28 7 5 69 10 11 12

16 15 13 14



a = { 15, 4, 10, 6, 1, 5, 7, 11, 12, 14, 13, 8, 9, 16, 2, 3 }

Column phase


1 2 3 48 7 6 59 10 11 12

16 15 14 13



a = { 15, 4, 10, 6, 1, 5, 7, 11, 12, 14, 13, 8, 9, 16, 2, 3 }

Done!


1 2 3 48 7 6 59 10 11 12

16 15 14 13


• Theorem: Shearsort sorts n2 elements in O(n lg n) steps on an n × n mesh

• We can use the Zero-One Principle!– Only because algorithm is comparison-exchange• Can be implemented using comparators only

– and oblivious• Outcome of comparator does not influence

comparisons made later on– (Disclaimer: reference is actually very unclear

about this)

Correctness of Shearsort

0 1 0 0 1 0 0 10 1 1 1 0 1 1 10 0 1 0 1 0 0 11 0 0 1 0 0 1 01 1 1 0 1 0 1 00 0 0 1 1 1 0 10 0 1 1 1 1 1 11 1 0 0 1 1 1 1


0 0 0 0 0 1 1 11 1 1 1 1 1 0 00 0 0 0 0 1 1 11 1 1 0 0 0 0 00 0 0 1 1 1 1 11 1 1 1 0 0 0 00 0 1 1 1 1 1 11 1 1 1 1 1 0 0

1 full row of 1’s

1 full row of 0’s

1 full row of 1’s

1 full row of 1’s


• lg(n) × 2 phases, each phase takes n steps

0 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 1 0 00 0 1 1 0 1 0 01 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 1

Sort spaceguaranteed to

be halvedafter 2 phases

Distributed Algorithms

Different concerns altogether…

• Problems usually easy to parallelize• Main problems:

– Inherently asynchronous– How to broadcast data and ensure every node gets it– How to minimize bandwidth usage– What to do when nodes go down (decentralization)– (Do we trust the results given by the nodes?)

2, 3, 5, 7, 13 …

… 242643801-1, 243112609-1…DES (56-bit) SETI@Home

Message-Optimality

• New language constructs:

send <M> to p receive <M> from p terminate

• Message-complexity = number of messages sent by a distributed algorithm (also uses O-notation)

Broadcast

• Initiators vs. noninitiators

• Simple case: ring network w/ one initiator

init_ring_broadcast() { send token to successor receive token from predecessor terminate}

ring_broadcast() { receive token from predecessor send token to successor terminate}

Theorem: init_ring_broadcast + ring_broadcast broadcasts to n machines using time and message complexity O(n)

Broadcast on a tree networkinit_broadcast() { N = { q | q is a child neighbor of p } for each q ∈ N send token to q terminate}

broadcast() { receive token from parent N = { q | q is a child neighbor of p } for each q ∈ N send token to q terminate}

Note: no acknowledgment!

2

1

3

6

4

5

Echo

• Creates a spanning tree out of any connected network

init_echo() { N = { q | q is a neighbor of p } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } terminate}

echo() { receive token from parent N = { q | q is a neighbor of p } – { parent } for each q ∈ N send token to q counter = 0 while (counter < |N|) { receive token counter = counter + 1 } send token to parent terminate}

2

1

3

6

4

5

Echo




2

1

3

6

4

5

0

nul

nul

nul

nul

nul

Echo




2

1

3

6

4

5

0

0

nul

nul

nul

nul

Echo




2

1

3

6

4

5

0

0

nul

nul

0

nul

Echo




2

1

3

6

4

5

0

0

nul

nul

0

0

Echo




2

1

3

6

4

5

1

0

0

0

1

2

Echo




2

1

3

6

4

5

1

0

2

2

3

2

Echo




2

1

3

6

4

5

2

0

fin

fin

fin

4

Echo




2

1

3

6

4

5

2

1

fin

fin

fin

fin

Echo




2

1

3

6

4

5

3

fin

fin

fin

fin

fin

Echo


Theorem: init_echo + echo has time complexity O(diameter) and message complexity O(edges)



2

1

3

6

4

5

fin

fin

fin

fin

fin

fin

Leader Election (for ring networks)init_election() {

send token, p.ID to successor min = p.ID receive token, token_id while (p.ID != token_id) { if token_id < min min = token_id send token, token_id to successor receive token, token_id }

if (p.ID == min) i_am_the_leader = true else i_am_the_leader = false terminate

}

election() {

i_am_the_leader = false do { receive token, token_id send token, token_id to successor } while (true)

}

Theorem: init_election + election runs in n steps with message complexity O(n2)

Documents

Parallel and Distributed Algorithms Eric Vidal Reference: R. Johnsonbaugh and M. Schaefer, Algorithms (International Edition). 2004. Pearson Education