A Taste of Parallel Algorithms A.Broumandnia, [email protected] 1 Learn about the nature of parallel algorithms and complexity: By implementing 5 building-block

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om

A Taste of Parallel Algorithms

1

Learn about the nature of parallel algorithms and complexity:• By implementing 5 building-block parallel computations• On 4 simple parallel architectures (20 combinations)

Topics in This Chapter

2.1 Some Simple Computations

2.2 Some Simple Architectures

2.3 Algorithms for a Linear Array

2.4 Algorithms for a Binary Tree

2.5 Algorithms for a 2D Mesh

2.6 Algorithms with Shared Variables

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om

2.1. SOME SIMPLE COMPUTATIONS

• In this section, we define five fundamental building-block computations:1. Semigroup (reduction, fan-in) computation2. Parallel prefix computation3. Packet routing4. Broadcasting, and its more general version,

multicasting5. Sorting records in ascending/descending order of

their keys

2

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om

Semigroup Computation• Let be an associative binary operator; i.e., (⊗ x ⊗ y ) ⊗ z= x ⊗

(y ⊗ z ) for all x, y, z ∈ S. A semigroup is simply a pair (S, ), ⊗where S is a set of elements on which is defined. ⊗ Semigroup (also known as reduction or fan-in ) computation is defined as: Given a list of n values x0, x1, . . . , xn–1, compute x0 ⊗ x1 . . ⊗. ⊗ xn–1 . Common examples for the operator include +, ×, ⊗

, , , ∩, , max, min. The operator may or may not be ∧ ∨ ⊕ ∪ ⊗commutative, i.e., it may or may not satisfy x ⊗ y = y ⊗ x (all of the above examples are, but the carry computation, e.g., is not). This last point is important; while the parallel algorithm can compute chunks of the expression using any partitioning scheme, the chunks must eventually be combined in left-to-right order. Figure 2.1 depicts a semigroup computation on a uniprocessor.

3

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om

Semigroup Computation

4

x0

identity element

x1

x2

xn–2

x

s

. . .

t = 0

t = 1

t = 2

t = 3

t = n – 1

t = n

n–1

Fig. 2.1 Semigroup computation on a uniprocessor.

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om

Parallel Semigroup Computation

5

x0 x1

x2

s

x3

x4 x5 x6 x7 x8 x9 x10

Semigroup computation viewed as tree or fan-in computation.

s = x0 x1 . . . xn–1

log2 n levels

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om

Parallel Prefix Computation

• With the same assumptions as in the preceding paragraph, a parallel prefix computation is defined as simultaneously evaluating all of the prefixes of the expression x0 ⊗ x 1 . . . ⊗ xn–1; i.e., x0, x0 ⊗ x1, x0 ⊗x1 ⊗ x2, . . . , x 0 ⊗ x1 . . . ⊗ ⊗ xn–1. Note that the ith prefix expression is si = x0 ⊗ x1 . . . ⊗ ⊗ xi. The comment about commutatively, or lack thereof, of the binary operator applies here as well. The graph ⊗representing the prefix computation on a uniprocessor is similar to Fig. 2.1, but with the intermediate values also output.

6

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om


7

x0

identity element

x1

x2

xn–2

x

. . .

t = 0

t = 1

t = 2

t = 3

t = n – 1

t = n

n–1

s0

s1

s2

sn–2

sn–1

Prefix computation on a uniprocessor.

Parallel version much trickier compared to that of semigroup computation

Requires a minimum of log2 n levels

s = x0 x1 x2 . . . xn–1

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om

Routing

• A packet of information resides at Processor i and must be sent to Processor j. The problem is to route the packet through intermediate processors, if needed, such that it gets to the destination as quickly as possible. The problem becomes more challenging when multiple packets reside at different processors, each with its own destination. In this case, the packet routes may interfere with one another as they go through common intermediate processors. When each processor has at most one packet to send and one packet to receive, the packet routing problem is called one-to-one communication or 1–1 routing. 8

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om

Broadcasting• Given a value a known at a certain processor i,

disseminate it to all p processors as quickly as possible, so that at the end, every processor has access to, or “knows,” the value. This is sometimes referred to as one-to-all communication. The more general case of this operation, i.e., one-to-many communication, is known as multicasting. From a programming viewpoint, we make the assignments xj: = a for 1 ≤ j ≤ p (broadcasting) or for j ∈ G (multicasting), where G is the multicast group and xj is a local variable in processor j.

9

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om

Sorting

• Rather than sorting a set of records, each with a key and data elements, we focus on sorting a set of keys for simplicity. Our sorting problem is thus defined as: Given a list of n keys x0, x1, . . . , xn–1, and a total order ≤ on key values, rearrange the n keys as xi nondescending order. Any algorithm for sorting values in nondescending order can be 0, xi1, . . . , xin–1 , such that xi0 ≤ xi1≤ . . . ≤ xin–1. We consider only sorting the keys in converted, in a straightforward manner, to one for sorting the keys in nonascending order or for sorting records.

10

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om

2.2. SOME SIMPLE ARCHITECTURES

• In this section, we define four simple parallel architectures:

1. Linear array of processors

2. Binary tree of processors

3. Two-dimensional mesh of processors

4. Multiple processors with shared variables

11

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om

Linear Array

• Figure 2.2 shows a linear array of nine processors, numbered 0 to 8. The diameter of a p-processor linear array, defined as the longest of the shortest distances between pairs of processors, is D = p – 1. The ( maximum) node degree, defined as the largest number of links or communication channels associated with a processor, is d = 2. The ring variant, also shown in Fig. 2.2, has the same node degree of 2 but a smaller diameter of D=. 12

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om

Linear Array

13

P2P0 P1 P3 P4 P5 P6 P7 P8

P2P0 P1 P3 P4 P5 P6 P7 P8

Fig. 2.2 A linear array of nine processors and its ring variant.

Max node degree d = 2Network diameter D = p – 1 ( p/2 )Bisection width B = 1 ( 2 )

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om

Binary Tree• Figure 2.3 shows a binary tree of nine processors. This binary

tree is balanced in that the leaf levels differ by at most 1. If all leaf levels are identical and every no leaf processor has two children, the binary tree is said to be complete. The diameter of a p-processor complete binary tree is 2 log2 (p + 1) – 2. More generally, the diameter of a p-processor balanced binary tree architecture is 2 or 2– 1, depending on the placement of leaf nodes at the last level. Unlike linear array, several different p-processor binary tree architectures may exist. This is usually not a problem as we almost always deal with complete binary trees. The (maximum) node degree in a binary tree is d = 3.

14

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om

Binary Tree

15

P1

P0

P3

P4

P2P5

P7 P8

P6

Fig. 2.3 A balanced (but incomplete) binary tree of nine processors.

Complete binary tree 2q – 1 nodes, 2q–1 leaves

Balanced binary tree Leaf levels differ by 1

Max node degree d = 3Network diameter D = 2 log2 p ( - 1 )Bisection width B = 1

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om

2D Mesh• Figure 2.4 shows a square 2D mesh of nine processors. The

diameter of a p-processor square mesh is – 2. More generally, the mesh does not have to be square. The diameter of a p-processor r × (p/r) mesh is D = r + p/r – 2. Again, multiple 2D meshes may exist for the same number p of processors, e.g., 2 × 8 or 4 × 4. Square meshes are usually preferred because they minimize the diameter. The torus variant, also shown in Fig. 2.4, has end-around or wraparound links for rows and columns. The node degree for both meshes and tori is d = 4. But a p-processor r × (p/r) torus has a smaller diameter of D = + .

16

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om

2D Mesh

17

P P P

P P P

P P P

0 1 2

3 4 5

6 7 8

P P P

P P P

P P P

0 1 2

3 4 5

6 7 8

Max node degree d = 4Network diameter D = 2p – 2 ( p )Bisection width B p ( 2p )

Fig. 2.4 2D mesh of 9 processors and its torus variant.

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om

Shared Memory• A shared-memory multiprocessor can be modeled as a

complete graph, in which every node is connected to every other node, as shown in Fig. 2.5 for p = 9. In the 2D mesh of Fig. 2.4, Processor 0 can send/receive data directly to/from P1 and P 3 .However, it has to go through an intermediary to send/receive data to/from P 4 , say. In a shared-memory multiprocessor, every piece of data is directly accessible to every processor (we assume that each processor can simultaneously send/receive data over all of its p – 1 links). The diameter D = 1 of a complete graph is an indicator of this direct access. The node degree d = p – 1, on the other hand, indicates that such an architecture would be quite costly to implement if no restriction is placed on data accesses. 18

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om

Shared Memory

19

P 1

P 2

P 3

P 4

P 5

P 6

P 7

P 8

P 0

Max node degree d = p – 1Network diameter D = 1

Bisection width B = p/2 p/2

Costly to implementNot scalable

But . . . Conceptually simpleEasy to program

Fig. 2.5 A shared-variable architecture modeled as a complete graph.

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om

20

Architecture/Algorithm Combinations

P 1

P 2

P 3

P 4

P 5

P 6

P 7

P 8

P 0

Semi-group

P2P0 P1 P3 P4 P5 P6 P7 P8

P2P0 P1 P3 P4 P5 P6 P7 P8

P P P

P P P

P P P

0 1 2

3 4 5

6 7 8

P P P

P P P

P P P

0 1 2

3 4 5

6 7 8

Parallel prefix

Packet routing

Broad-casting

Sorting

P1

P0

P3

P4

P2P5

P7 P8

P6

We will spend more time on linear array and binary tree

and less time on mesh and shared memory (studied later)

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om

2.3. ALGORITHMS FOR A LINEAR ARRAY

• Semigroup Computation. Let us consider first a special case of semigroup computation, namely, that of maximum finding. Each of the p processors holds a value initially and our goal is for every processor to know the largest of these values. A local variable, max-thus-far, can be initialized to the processor’s own data value. In each step, a processor sends its max-thus-far value to its two neighbors. Each processor, on receiving values from its left and right neighbors, sets its max-thus-far value to the largest of the three values, i.e., max(left, own, right). Figure 2.6 depicts the execution of this algorithm for p = 9 processors. The dotted lines in Fig. 2.6 show how the maximum value propagates from P6 to all other processors. Had there been two maximum values, say in P2 and P 6 , the propagation would have been faster. In the worst case, p – 1 communication steps (each involving sending a processor’s value to both neighbors), and the same number of three-way comparison steps, are needed. This is the best one can hope for, given that the diameter of a p-processor linear array is D = p – 1 (diameter-based lower bound). 21

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om

Semigroup Computation

22

Fig. 2.6 Maximum-finding on a linear array of nine processors.

5 2 8 6 3 7 9 1 4 5 8 8 8 7 9 9 9 4 8 8 8 8 9 9 9 9 9 8 8 8 9 9 9 9 9 9 8 8 9 9 9 9 9 9 9 8 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9

Initial values

Maximum identi fied

0 P 1 P 2 P 3 P P P 6 P 7 P 8 P 5 4

For general semigroup computation:Phase 1: Partial result is propagated from left to rightPhase 2: Result obtained by processor p – 1 is broadcast leftward

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om

Parallel Prefix Computation• Parallel Prefix Computation. Let us assume that we want the

ith prefix result to be obtained at the ith processor, 0 ≤ i ≤ p –1. The general semigroup algorithm described in the preceding paragraph in fact performs a semigroup computation first and then does a broadcast of the final value to all processors. Thus, we already have an algorithm for parallel prefix computation that takes p – 1 communication/combining steps. A variant of the parallel prefix computation, in which Processor i ends up with the prefix result up to the (i – 1)th value, is sometimes useful. This diminished prefix computation can be performed just as easily if each processor holds onto the value received from the left rather than the one it sends to the right. The diminished prefix sum results for the example of Fig. 2.7 would be 0, 5, 7,15, 21, 24, 31, 40, 41.

23

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om


24

Fig. 2.7 Computing prefix sums on a linear array of nine processors.

5 2 8 6 3 7 9 1 4 5 7 8 6 3 7 9 1 4 5 7 15 6 3 7 9 1 4 5 7 15 21 3 7 9 1 4 5 7 15 21 24 7 9 1 4 5 7 15 21 24 31 9 1 4 5 7 15 21 24 31 40 1 4 5 7 15 21 24 31 40 41 4 5 7 15 21 24 31 40 41 45

Initial values

Final results

0 P 1 P 2 P 3 P P P 6 P 7 P 8 P 5 4

Diminished parallel prefix computation:The ith processor obtains the result up to element i – 1

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om

Parallel Prefix Computation• Thus far, we have assumed that each processor holds a single

data item. Extension of the semigroup and parallel prefix algorithms to the case where each processor initially holds several data items is straightforward. Figure 2.8 shows a parallel prefix sum computation with each processor initially holding two data items. The algorithm consists of each processor doing a prefix computation on its own data set of size n/p (this takes n/p – 1 combining steps), then doing a diminished parallel prefix computation on the linear array as above ( p– 1 communication/combining steps), and finally combining the local prefix result from this last computation with the locally computed prefixes (n /p combining steps). In all, 2n/p + p– 2 combining steps and p – 1 communication steps are required.

25

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om


26Fig. 2.8 Computing prefix sums on a linear array with two items per processor.

5 2 8 6 3 7 9 1 4 1 6 3 2 5 3 6 7 5 5 2 8 6 3 7 9 1 4 6 8 11 8 8 10 15 8 9 + 0 6 14 25 33 41 51 66 74 = 5 8 22 31 36 48 60 67 78 6 14 25 33 41 51 66 74 83

Initial values

Final results

0 P 1 P 2 P 3 P P P 6 P 7 P 8 P 5 4

Local prefixes

Diminished prefixes

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om

Packet Routing• To send a packet of information from Processor i to Processor j

on a linear array, we simply attach a routing tag with the value j – i to it. The sign of a routing tag determines the direction in which it should move (+ = right, – = left) while its magnitude indicates the action to be performed (0 = remove the packet, nonzero = forward the packet). With each forwarding, the magnitude of the routing tag is decremented by 1. Multiple packets originating at different processors can flow rightward and leftward in lockstep, without ever interfering with each other.

27

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om

Broadcasting• If Processor i wants to broadcast a value a to all processors, it

sends an rbcast(a) (read r-broadcast) message to its right neighbor and an lbcast(a) message to its left neighbor. Any processor receiving an rbcast(a ) message, simply copies the value a and forwards the message to its right neighbor (if any). Similarly, receiving an lbcast(a) message causes a to be copied locally and the message forwarded to the left neighbor. The worst-case number of communication steps for broadcasting is p – 1.

28

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om

Linear Array Routing and Broadcasting

29

Routing and broadcasting on a linear array of nine processors.

To route from processor i to processor j: Compute j – i to determine distance and direction

0 P 1 P 2 P 3 P P P 6 P 7 P 8 P 5 4

Right-moving packets

Left-moving packets

To broadcast from processor i: Send a left-moving and a right-moving broadcast message

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om

Sorting• We consider two versions of sorting on a linear array: with and

without I/O. Figure 2.9 depicts a linear-array sorting algorithm when p keys are input, one at a time, from the left end. Each processor, on receiving a key value from the left, compares the received value with the value stored in its local register (initially, all local registers hold the value +∞). The smaller of the two values is kept in the local register and larger value is passed on to the right. Once all p inputs have been received, we must allow p – 1 additional communication cycles for the key values that are in transit to settle into their respective positions in the linear array. If the sorted list is to be output from the left, the output phase can start immediately after the last key value has been received. In this case, an array half the size of the input list would be adequate and we effectively have zero-time sorting, i.e., the total sorting time is equal to the I/O time. 30

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om

Linear Array Sorting (Externally Supplied Keys)

31

5 2 8 6 3 7 9 1

5 2 8 6 3 7 9

5 2 8 6 3 7 5 2 8 6 3 5 2 8 6 5 2 8

5 2 5

5 2 8 6 3 7 9 1 4

4

1 4

4 9 1

1 9 4 7

1 7 4 3 9

1 3

4 7 9

1

1 2

3

3

4 7 6

8

8

6

9

4 6 7 9

1

1

1

4 9 1

1 9 4 7

1

7 3

9

1

3

4 7 9

1

1

2

3

3

4 7 6

8

8

6

9 4

5

6 7 9

5

5

5

5

2

2

2 8

8

6 6

2

2

2

2

2

3

3

3

3

3

4

4

4

4

5

5

5

5 6

6

6

6

7

7

7

8

8

8 9

9

7 8

8

9

Fig. 2.9 Sorting on a linear array with the keys input sequentially from the left.

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om

Sorting• If the key values are already in place, one per processor, then

an algorithm known as odd–even transposition can be used for sorting. A total of p steps are required. In an odd-numbered step, odd-numbered processors compare values with their even-numbered right neighbors. The two processors exchange their values if they are out of order. Similarly, in an even-numbered step, even-numbered processors compare–exchange values with their right neighbors (see Fig. 2.10). In the worst case, the largest key value resides in Processor 0 and must move all the way to the other end of the array. This needs p – 1 right moves. One step must be added because no movement occurs in the first step. Of course one could use even–odd transposition, but this will not affect the worst-case time complexity of the algorithm for our nine-processor linear array.

32

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om

33

Linear Array Sorting (Internally Stored Keys)

Fig. 2.10 Odd-even transposition sort on a linear array.

5 2 8 6 3 7 9 1 4 5 2 8 3 6 7 9 1 4 2 5 3 8 6 7 1 9 4 2 3 5 6 8 1 7 4 9 2 3 5 6 1 8 4 7 9 2 3 5 1 6 4 8 7 9 2 3 1 5 4 6 7 8 9 2 1 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9

In odd steps, 1, 3, 5, etc., odd-numbered processors exchange values with their right neighbors

0 P 1 P 2 P 3 P P P 6 P 7 P 8 P 5 4

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om

Performance evaloation• Note that the odd–even transposition algorithm uses p processors to

sort p keys in p compare–exchange steps. How good is this algorithm? Let us evaluate the odd–even transposition algorithm with respect to the various measures introduced in Section 1.6. The best sequential sorting algorithms take on the order of p log p compare–exchange steps to sort a list of size p. Let us assume, for simplicity, that they take exactly p log2 p steps. Then, we have T (1) = W(1) = p log 2 p, T (p) = p, W(p) = p 2/2, S (p) = log2 p (Minsky’s conjecture?), E(p) = (log In most practical situations, the number n of keys to be sorted (the problem size) is greater 2 p)/p, R(p) = p/(2 log2 p ), U(p) = 1/2, and Q(p) = 2(log2 p)3/ p2.

34T(1) = W(1) = p log2 p T(p) = p W(p) p2/2

S(p) = log2 p (Minsky’s conjecture?) R(p) = p/(2 log2 p)

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om

odd–even transposition algorithm

• In most practical situations, the number n of keys to be sorted (the problem size) is greater than the number p of processors (the machine size). The odd–even transposition sort algorithm with n/p keys per processor is as follows. First, each processor sorts its list of size n/p using any efficient sequential sorting algorithm. Let us say this takes ( n/p)log2(n/p ) compare–exchange steps. Next, the odd–even transposition sort is performed as before, except that each compare–exchange step is replaced by a merge–split step in which the two communicating processors merge their sublists of size n/p into a single sorted list of size 2n/p and then split the list down the middle, one processor keeping the smaller half and the other, the larger half. For example, if P0 is holding (1, 3, 7, 8) and P1 has (2, 4, 5, 9), a merge–split step will turn the lists into (1, 2, 3, 4) and (5, 7, 8, 9), respectively. Because the sublists are sorted, the merge–split step requires n/p compare–exchange steps. Thus, the total time of the algorithm is (n/p)log2 (n/p ) + n . Note that the first term (local sorting) will be dominant if p < log 2 n, while the second term (array merging) is dominant for p > log2n . For p ≥ log 2 n, the time complexity of the algorithm is linear in n; hence, the algorithm is more efficient than the one-key-per-processor version.

35

Documents

A Taste of Parallel Algorithms A.Broumandnia, [email protected] 1 Learn about the nature of parallel algorithms and complexity: By implementing 5 building-block