Efficient parallel algorithms for graph problemsEfficient Parallel Algorithms for Graph Problems 45 Subsequently, various parallel algorithms achieving linear speedup for solving the

Algorithmica (1990) 5:43-64 Algorithmica �9 1990 Springer-Verlag New York Inc.

Efficient Parallel Algorithms for Graph Problems 1

Clyde P. Kruska l , 2 Lar ry R u d o l p h , 3 and Marc Snir 4

Abstraet. We present an efficient technique for parallel manipulation of data structures that avoids memory access conflicts. That is, this technique works on the Exclusive Read/Exclusive Write (ER EW) model of computation, which is the weakest shared memory, MIMD machine model. It is used in a new parallel radix sort algorithm that is optimal for keys whose values are over a small range. Using the radix sort and known results for parallel prefix on linked lists, we develop parallel algorithms that efficiently solve various computations on trees and "unicycular graphs." Finally, we develop parallel algorithms for connected components, spanning trees, minimum spanning trees, and other graph problems. All of the graph algorithms achieve linear speedup for all but the sparsest graphs.

Key Words. Biconnected components, Connected components, Minimum spanning trees, Parallel algorithms, Parallel processing, PRAM, Radix sort, Spanning trees, Tree computations.

1. In t roduct ion. We p resen t new para l l e l a lgor i thms for several i m p o r t a n t g raph p r o b l e m s such as s p a n n i n g forests , connec t ed c o m p o n e n t s , b i c o n n e c t e d com- ponen t s , and (und i r ec t ed ) m i n i m u m cost spann ing trees (MST) . Al l o f the

a lgor i thms are i m p r o v e m e n t s over p rev ious resul ts in several respects : they assume the weakes t m o d e l o f sha red m e m o r y para l l e l c o m p u t a t i o n , i.e., the E R E W (Exclus ive Read , Exclus ive Wri te) mode l ; they are de te rminis t ic ; they are fast on all g r a p h densi t ies . In add i t ion , the inpu t is not r equ i red to be in any p r e a r r a n g e d order , such as a d j a c e n c y lists; the inpu t can consis t o f an u n o r d e r e d list o f edges. F o r g raphs tha t are not ex t remely sparse (i.e., as long as n = o (e ) ) , a n d if the p r o b l e m size is large relat ive to the n u m b e r o f processors , the a lgor i thms achieve a l inear speedup , i.e., they have op t ima l p rocesso r - t ime p roduc t . A bas ic c o m p o n e n t o f our g raph a lgor i thms is a pa ra l l e l vers ion o f rad ix sort tha t is o f i n d e p e n d e n t in teres t s ince it is op t ima l as long as the range o f the i tems to be sor ted is at most p o l y n o m i a l l y larger than the n u m b e r o f processors .

The a lgor i thms e m p l o y a set o f useful t echniques for efficient pa ra l l e l m a n i p u l a - t ion o f da t a s t ructures tha t avo id memory -acces s conflicts be tween processors . The mos t power fu l and or ig ina l t echn ique is a recurs ive two-s tep , b l o c k - b a s e d

Part of this work was done while the first author was at the University of Illinois, Urbana-Champaign, the second author was at Carnegie-Mellon University, and the third author was at the Hebrew University and the Courant Institute of Mathematical Sciences, New York University. A preliminary version of this work was presented at the 1986 International Conference on Parallel Processing. 2 Computer Science Department, University of Maryland, College Park, MD 20742, USA. 3 Institute of Mathematics and Computer Science, The Hebrew University of Jerusalem, Jerusalem, Israel. 4 IBM T. J. Watson Research Center, P.O. Box 218, Yorktown Heights, NY 10598, USA.

Received April, 1986; revised December, 1987, and June, 1988. Communicated by F. Thomson Leighton.

44 C.P. Kruskal, L. Rudolph, and M. Snir

algorithm that ensures exclusive memory access while at the same time distributing work so that all the processors are always performing useful actions.

The overall structure of the paper consists of two parts: Sections 2-5 describe basic building blocks and only in Section 6 are the graph algorithms described. Section 2 reviews the model and related research on solving linked-list problems in parallel. In Section 3 we describe the recursive two-step, block-based technique and its use in the radix-sort algorithm. In Section 4 we show how a graph can be converted from an edge-list representation to an adjacency-list representation efficiently in parallel. Section 5 shows how tree problems can be solved efficiently in parallel. These basic building blocks are then used in Section 6 to obtain parallel algorithms for finding spanning trees, connected components, biconnected components, and minimum spanning trees.

2. Preliminaries. We assume the PRAM computation model: a PRAM consists of p autonomous processors, all having access to a common memory. At each step, each processor performs one operation from its instruction stream. Instruc- tions accessing shared memory are also assumed to be accomplished in one cycle. Borrowing the notation of Snir [Sn], we distinguish three variants of the PRAM family: In a Concurrent Read, Concurrent Write (CRCW) model, processors may simultaneously access the same memory location. There are various schemes for resolving write conflicts. In a Concurrent Read, Exclusive Write (CREW) model, several processors may simultaneously read the value of a memory location, but exclusive access is required for writes. Finally, in the Exclusive Read, Exclusive Write (EREW) model, a memory location cannot be simultaneously accessed by more than one processor. Our algorithms can be implemented on the weakest of these three models: the EREW PRAM.

The product computation problem is to compute the product ao ~ at . . . . . an-t , given n items ao, a t , . . . , an-l , and a binary operation, denoted o. The initial prefix problem is to compute all n initial prefixes ao, a0 ~ a l , . . . , ao o at . . . . . an_ 1 . The initial prefix problem, when solved in parallel, is known as the parallelprefix problem. In this paper we consider only associative operations (such as ordinary and Boolean addition and multiplication). When the items are arranged in consecutive memory locations, there are well-known methods to solve both problems in time O ( n / p + l o g p ) (see [F] or [Sc]).

The problem is harder to solve when the items are ordered by a linear linked list. All the items are accessible but it is not known which item is which. Only the location of the first item is given along with a map Succ from the ith item to the ( i + l ) s t item, and a map Pred from the ith item to the ( i - 1 ) s t item (the second mapping need not be given since it can be computed from the first mapping in O ( n / p ) time).

It is easy to solve the product and initial prefix problems sequentially in O ( n ) time by starting at the first item and following the links. Wyllie [W] showed that this problem can be solved with p < n processors in O((n log n ) / p ) time, using a recursive doubling technique. His algorithm never achieves linear speedup.


Subsequently, various parallel algorithms achieving linear speedup for solving the product and initial prefix problems on linked lists have been suggested. In particular, Vishkin [V] presents several probabilistic algorithms. Kruskal et al. [KRS] present a deterministic, EREW algorithm requiring parallel time

O(plog(2n/p)]l~ ~ a n d t h u s a c h i e v e s l i n e a r s p e e d u p f ~ 1 7 6 1 7 6

e > 0. Cole and Vishkin [CV1], [CV2] present several algorithms, one of which requires parallel time O(n/p + log p) on the EREW PRAM model. It is asymptotically optimal and achieves linear speedup for n _> 12(p log p). All these algorithms share a common structure: In a first phase the list is compacted, repeatedly replacing pairs of elements by their product, until one item is left. In the second phase the list is expanded back, and all missing partial products are computed.

Product and initial prefix computations on linked lists are used as basic building blocks in most of the algorithms in this paper. Although any of the results just cited can be used, there are tradeoffs. We use the asymptotically best algorithms of Cole and Vishkin [CV2] for our analyses, although they have an inordinate amount of overhead, making them impractical. The more practical algorithms of Kruskal et aL [KRS] can be used in our graph algorithms with no loss of efficiency by decreasing the number of processors used from p to pl-~, e > 0. Some of the probabilistic algorithms have small overhead, making them very attractive for practical use.

Linked-list parallel prefix can be used to solve the linked-list packing problem: given a list of n items, a subset of which are active, pack the active items into contiguous locations of memory and update the links accordingly. We solve this problem by finding the rank of each active item within the linked list. Assign active items the value 1 and inactive items the value 0, and find all of the partial sums of the assigned values using a linked-list parallel prefix algorithm. The partial sum for each active item is its rank. The active items can then be placed into an array by simply using the rank of each active item as its index. As a by-product, the items are packed in order so that the ith item is contiguous to the (i - 1)st and (i + 1 )st items. This allows a linked list of the items to be recreated, consisting only of active items in their original order.

A slight variation of the parallel prefix algorithm can be used to broadcast the product of all the node values to every node in the list, within the above-stated time b o u n d : We call this the product-broadcast algorithm. Product broadcast can be used to broadcast to all items in the list the value of the last item, or the value of the least item in the list. Items can be arranged into groups by marking the head item of each group and the prefix computations can be computed separately on each group (see [Sc]). Some applications are to label each node in a list with the name of the first node in the list, or with the name of the least node in the list. The product-broadcast algorithm can also be applied to a linked circular list; the result is unique provided that o is a commutative operation.

5 Apply the compaction phase of parallel prefix in the normal way. As the list is expanded assign the product to all items.

46 c.P. Kruskal, L. Rudolph, and M. Snir

I f the parallel prefix algorithm is applied to a union of disjoint lists, it will compute initial prefixes in parallel in each list individually. Similarly, if the product-broadcast algorithm is applied to a union of disjoint lists (circular lists), it will assign to each node the product of the items in the appropriate list.

3. Radix Sort. Our graph algorithms make heavy use of sorting lists of nodes and edges. The nodes in a graph can be uniquely identified by numbers in the range 0 , . . . , n - 1 and edges by pairs of numbers each in the same range 0 , . . . , n - 1. Because this range is so small, radix sorting can be efficiently used. Unlike the usual sorting algorithms, radix sort requires linear sequential time (in terms of the number of items and the range of values); we present an efficient parallel version of the sequential radix-sort algorithm.

Assume the n items to be sorted have keys that are restricted to the range 0 , . . . , m - 1. The sequential algorithm creates m buckets, one for each integer in the range 0 , . . . , m - 1. The items are placed into the buckets one at a time: an item with key value i is placed into bucket i. The nonempty buckets are then concatenated to produce the sorted list. It takes O(n) time to place the items into the buckets and then O(m) time to concatenate the buckets for a total time of O(n+m).

The parallel version of this algorithm is essentially the same, except we wish to place p items into the buckets at each time step and then traverse the buckets p at a time. The difficulty is to avoid having more than one processor attempt to place items into the same bucket at the same time. We present a recursive algorithm for solving this problem. Assume without loss of generality that p and n are powers of 2, and that p <- n/2.

Let r = n/p be the ratio of the problem size to the number of processors. The items are arranged in an r xp two-dimensional array A [ 0 . . r - 1 , 0 . . p - l ] . The algorithm consists of two phases. In the first phase rows are processed: items are inserted into buckets so that two items in the same row with the same value are in the same bucket. In the second phase columns are processed: buckets representing the same key are merged together, resulting in one bucket per key value. The processors sweep sequentially through the rows, processing each row in parallel. Each bucket is stored as a linked list; the bucket header has pointers to the head and tail of this list.

We now describe the first phase in detail. I f r>-p (i.e., p---x~nn) then each processor is allocated r/p rows; each processor uses a distinct set of m buckets. The processor traverses its set of rows sequentially, inserting each item into the appropriate bucket. Since each processor accesses a different set of items and buckets there are no memory-access conflicts.

Previous to the insertions the buckets must be initialized to empty (by properly initializing the bucket header). I f r is much less than m, it is important that each processor does not spend O(m) time initializing its buckets to empty. Also, we should prevent the use of uninitialized pointers as this could lead to conflicting accesses. The standard method used in sequential computat ion to avoid initializing an entire array (see Problem 2.12 of [AHU]) performs indirect accesses using


uninitialized pointers; it is therefore not suitable for the EREW model. Our method is to initialize correctly only those buckets that are going to be used. There are two passes over the data. In the first pass no insertions are made; for each item, the bucket that the item is to be inserted into is initialized to empty (the same bucket may be initialized several times). In the second pass the items are actually linked into buckets. This creates a structure in which nonempty buckets have a valid format, whereas empty buckets may contain garbage. This phase requires O(n/p) time and O(pm + n) space.

I f p > r then each row is allocated p/r processors. The parallel algorithm is called recursively for each row, so as to insert the items in this row into rn buckets.

In the second phase the lists are merged together into a global bucket structure, i.e., for each key value the min(r, p) buckets for each are linked together. The rows of A are processed one at a time with all the processors working on the same row simultaneously. Each processor picks an item; if this item is at the tail of its bucket (i.e., has a nil pointer) the processor links the entire bucket to the corresponding bucket in the global structure. Since items in the same row with the same key value are in the same bucket, no memory conflicts occur. The two-pass method described in the first phase is used to initialize correctly nonempty buckets. Each row is processed in constant time, and the total time for the second phase is O(r)= O(n/p).

At this point, there is one bucket structure with each item in its proper bucket; empty buckets may contain garbage.

Let Tp(n, m) be the running time of the algorithm. Then,

Tp(n)={O(n/p) if p<-r, Tp/r(n/r)+O(r) if p>r.

The ratio r between problem size and number of processors stays constant across the successive levels of recursion; the number of collaborating processors decreases by a factor of r at each level. Thus, the number of levels of recursion is logrp, and each level takes O(r) time. The total time for the algorithm is

Tp(n) = O(r lOgr(p) + n/p)

_-O(r! +./,) \ log r /

log(n/p)

log(n/p)]"

There are at most p tasks running in parallel, each using an array of size m (in addition to the space used to store the items). Thus, the space used is O(pm + n).


The last pass of the recursive algorithm can be modified so that all empty buckets are initialized correctly, with an overhead of O(m/p). At the end of the computation, the nonempty buckets can be linked together in time O(m/p+logp) , creating a sorted linked list of the items. The parallel prefix algorithm [CV2] can then be used to rank the items and store them in a sorted array in time O(n/p+logp) .

We have proved the following lemma.

LEMMA 3.1. A list of n items with keys in the range 0 , . . . , m - 1 can be sorted

( p logn m\ P / with p<-n/2 processors in time 0 log(n/p) +--I and space O(pm+ n).

COROLLARY 3.2. A list of n items with keys in the range O..m k - 1 can be sorted

w i thp<-n /2pr~176 l~ p ) ) log(n/p) ~- and space O(pm + n).

PROOF. As with the sequential version of radix sort, we iterate through the k radix m digits of the keys. Using the above algorithm, the sort for each digit

takes time 0 log(n/p) t- and space O(pm + n). Furthermore, each iteration

is stable: during the execution within a block, each processor sorts sequentially using radix sort, which is stable. After that, the processors traverse the blocks, concatenating buckets representing the same digit; as long as the blocks are traversed in order, this is also stable. The result follows. []

COROLLARY 3.3. A list of n numbers in the range 0 , . . . , R - 1 can be sorted with (p "~,,,/yJ/l~

p <- n/2 processors in time 0 1 -~7--7_~ / and space O(pn ~ + n) for any constant e>O.

PROOF. We use the previous corollary, with R = m k. Substituting k =

log R/ log m, yields a sorting algorithm using time \ log m p log-~)-p) +

and space O(pm + n). The time is minimized (up to a constant factor) for m = n ~

for any constant 0 < e < 1. Thus, the sort takes time 0 1 o -~ ~ ) ] and the space is O(pn E + n). []

The sort algorithm has optimal running time O(n/p) when R <--(n/p) ~

4. Graph Adjacency-List Construction and Map Composition. The adjacency-list representation of a (directed) graph is particularly useful for many parallel graph algorithms. Unfortunately, this is not always the input format of a graph. Some- times the input is a list of the edges of the graph (presented as pairs of nodes), stored in an arbitrary order in an array. This format arises naturally when a graph


is constructed during the execution of programs that use graphs for intermediate operations. The graph adjacency-list construction problem is to construct an adjacency-list representation of the graph, i.e., an array L[1..n] of linked lists, where L[i] is the list of nodes adjacent to node i, given an unordered list of edges.

Constructing the adjacency lists for a graph with n nodes and e edges requires only O ( e + n) serial time and O ( e + n) space. The array L[O..n- 1] is initialized to nil in O(n) time. The adjacency lists are then constructed in O(e) time by inserting edges successively into the appropriate adjacency lists.

The parallel version using p processors is not so simple. When the graph is dense, i.e., e = O(n2), a matrix can be used to avoid interference. When the graph is sparse we must be careful to ensure that multiple processors do not try to insert items to the same adjacency list at the same time. We use sorting to avoid the conflicts.

THEOREM 4.1. The edge-list to adjacency-list conversion problem can be solved on

an E R E W model with p<-e /2 processors in O ( p l o g n '~ space for any constant e > O. l og (e /p ) / time and O(pe ~ + e)

PROOF. Each edge is represented by an ordered pair node of numbers, both in the range 0, . . . , n - 1. Initialize the adjacency list L[O..n - 1] to nil. Sort the edges by their first component. All of the edges belonging to the same adjacency list are now in contiguous locations of the array. Traverse the edges p at a time: link each edge to the edge in the succeeding location of the array, if the two edges have the same first component; link the first edge in each list (i.e., edges with no links to them) into the appropriate location of L[O..n - 1]. By Corollary 3.3 this

takes time O log(e /p)] and space O(pe ~ + e). []

Note. The space requirement is proportional to the number of edges for p -< e ~-~. If, also, e -> n ~ for some constant 8 > 0 (which we certainly expect), the speedup is linear.

COROLLARY 4.2. Given a graph with e edges and n nodes (represented by an edge list or an adjacency list), the adjacency-list representation of the reversed graph (i.e., the graph where the direction of each edge is reversed) can be computed on

an EREW model with p processors in time 0 and space O(pe ~ + e) for any constant e > O. log( e / p ) ]

PROOF. The proof is the same as for the previous corollary, using the second component of each edge in place of the first component. []

These results can also be used to compute the composition of mappings. Let f be a function with domain 0 , . . . , n - 1 and range 0 , . . . , m - 1, and let g be a function with domain 0 . . . . , m - 1. Assume that f is given as a (not necessarily sorted) list of pairs ( i , f ( i ) ) , and g is represented by an array of values, with the


ith entry storing g(i). We compute a representation of f o g as a list of pairs (i, g ( f ( i ) ) ) using the following steps:

(1) The pairs ( i , f ( i ) ) are stored in bucket lists, indexed by the value of the second coordinate f ( i ) .

(2) The first item of each list is marked. (3) Each marked item ( i , f ( i ) ) performs a look-up for the value of g( f ( i ) ) , and

stores this value. (4) The value of g ( f ( i ) ) is broadcast to all items in the bucket, using the product

broadcast algorithm.

Empty buckets are never accessed, so that it is not necessary to initialize them. We have proved the following theorem.

THEOREM 4.3. The composition problem can be solved on an EREW model with

p processors in time 0 l o g ( n / p ) / and space O(pn ~ + n) for any constant e > O.

If the function g is also represented by an unsorted list of pairs, the composition

be computed in time O ( m + n log(re+n_)) '~ can p l o g ( ( m + n ) / p ) ] ' using the composition

algorithm given by Schwartz [Sc].

5. Tree Problems

5.1. Tree Recursions. The parallel prefix algorithm can be extended to "parallel prefix" computations on a tree. This has been independently noted by other authors [AH], [CV3], [MR], [St], [TV].

Let T be a tree represented by its adjacency list. For each node u of T let u.val be a value stored at that node. Let

F(u) = ~ v.val v d e s c e n d a n t o f u

(where a node is taken to be a descendant of itself). The addition is an arbitrary associative and commutative operation. The function F can also be defined by the recursion

F ( u ) = Y. F(v)+u.val . v c h i l d o f u

The tree-recursion problem is to compute the value of F at each node of the tree. The parallel prefix problem is a particular case of this general problem, when the tree degenerates into a linked list. Interesting instances of tree recursion are:

(1) Computing the number d(u) of descendants of each node, including itself. We take u.val = 1, and + is usual addition.


General Parallel Tree-Recursion Algorithm if Tree contains more than one item then begin

Pick a set S of nodes such that (1) no two nodes in S are adjacent or have the same parent (2) each node of S has at most one child;

for each node u c S do begin

/* add the value of u to the value of parent(u) */ parent( u ). val <- parent(u), val + u. val; delete u; if u has a child then

parent(child(u)) <- parent(u); end; execute algorithm recursively; for each node u ~ S do begin

link u to parent(u); if u has a child then begin

/* add the value of child(u) to u and link child(u) to u */ u.val ~ u.val + child( u ).val; parent( child (u)) ~- u;

end end

end

Fig. 1

(2) C o m p u t i n g the he ight o f each node in the tree. To c ompu te one plus the he ight o f each node , we take u . v a l = 1, and + is the m a x i m u m opera t ion .

A n efficient pa ra l l e l a lgo r i thm for solving tree recurs ions works in a m a n n e r s imi la r to a l inked- l i s t pa ra l l e l prefix a lgor i thm. The genera l form of such an a lgo r i thm is shown in F igure 1. I t is easy to see that this a lgor i thm solves the

t r ee - recurs ion p r o b l e m correct ly . We have to show tha t a set S con ta in ing a fixed f rac t ion o f the nodes can be p i cked at each i te ra t ion , whi le s p e n d i n g only cons tan t t ime per node , and tha t the E R E W requ i remen t s are satisfied.

A s s u m e tha t the t ree T has degree b o u n d e d by two. We shall d iv ide each i t e ra t ion in the execu t ion o f the a lgor i thm into two stages, one where left ch i ld ren are act ive, the second where r ight ch i ld ren are active. This guaran tees that ch i ld ren o f the same pa ren t do not conflict .

Let SL be the set o f nodes in the tree that have no right child. Let SL be the set o f nodes tha t are in SL or that have a ch i ld in SL. Let SR and SR be s imi la r ly def ined. Then SL (SR) is the d i s jo in t un ion o f chains ; each node in such a chain , wi th the poss ib l e excep t ion o f its head , is in SL (SR); each cha in has length at leas t two. Note , too, tha t [SEW SR[• [T]/2.

The genera l pa ra l l e l t r ee - recurs ion a lgor i thm, when res t r ic ted to nodes in SL p e r f o r m s the same c o m p u t a t i o n as a para l l e l prefix a lgor i thm runn ing on a set o f l i nked lists. Thus, it is poss ib le to recurse on SL unti l each chain has been r e d u c e d to one i tem, t he r eby de le t ing all nodes in SL, in t ime O ( n / p + l o g p ) ;

52 C . P . Kruskal , L. Rudolph , and M. Snir

and similarly for SR. It follows that the tree size is reduced from n to at most n/2, in time O(n/p+logp) . We thus obtain that the recursion can be solved with p processors for a binary tree with n nodes in time O(n/p+log2p) in the EREW model.

Now let T be a general tree. The tree T can be mapped onto a binary tree T b, as described in Section 2.3.2 of [K]. The mapping is readily computed from the adjacency-list representation of T, with edges going from parent to child: v := left(u), if v is the first child in the adjacency list of u; v := right(u) if v follows u in the adjacency list of their common parent.

Let F b be the function computed on the binary tree T b by solving the tree- recursion problem on this tree. Then Fb(u) is the sum of v.val, taken over all descendants of u and descendants of younger siblings of u. We have

F(u) = Fb(left(u)) + u.val.

Thus, F can be easily computed from F b. We have proved the following theorem.

THEOREM 5.1. Given an adjacency-list representation of a valued tree with n nodes, the tree-recursion problem for the tree can be solved on an EREW model with p processors in time O(n/p + log 2 p).

This result can be improved to O(n/p + log p), which is optimal, using a more elaborate algorithm [CV3]. I f the " + " operation is associative and "invertible" (as is ordinary addition), then a simple technique will obtain time O(n/p + log p), as we will see in the next section.

5.2. Tree Traversals. Given a tree, T, with n nodes represented by an adjacency list (with edges going from parent to child) a linear linked list of the nodes arranged in preorder can be produced by p processors in time O(n/p). The algorithm requires the construction of an auxiliary list Taux; a similar construction is given by Wyllie [W] for binary trees.

Tau• contains twice as many nodes as tree T: for each node u in T, Taux contains Uu and uR. The order of these nodes by a mapping Next is as follows:

J VL if V is first child of u, u L �9 Next

t UR if U has no children,

VL if V follows u in the adjacency list of parent(u) , UR.Next = parent(u)R if u is last child of parent(u).

This mapping corresponds to a traversal of the tree in the order Root -Chi ldren- Root (UL represents the first traversal of u, and UR represents the second traversal). This is a depth-first search (dfs) traversal of the tree. Figure 2 presents an example.

The linked list defined by the relation Next can be created by p processors in the time it takes to execute parallel prefix on a linked list. Each node entry u has the information required to compute UL.Next. Each node (root excepted) has one incoming edge, so that it occurs exactly once in an adjacency list. For each entry that terminates an adjacency list, a pointer is computed to its parent (using product broadcast by groups). Each entry then has the information required to compute uR.Next.


b c g

. / 1 \ \ d e f h

o R

bL ~ bR }" CL CR }" gL gR

dL'--~"- dR " ~ a " e L'~-}"- eR ~ f L ~ fR hL

(i) (ii)

Fig. 2. Example of order traversal of a tree. (i) The input tree. (ii) Preorder linked list.

hR

A preorder list of the nodes can be obtained in time O ( n / p + log p) from this list by deleting all the "UR" nodes, and shrinking the list accordingly. Applying parallel prefix to this list will provide the preorder number of each node with linear speedup for p <- O(n / log n).

Applying this algorithm to a forest results in a preorder list for each tree in the forest, and computes the preorder number of each node in its tree. Postorder on trees or forests and inorder on binary trees can be handled in a similar manner. Note too that the adjacency list of a tree can be recreated from the list defined by the Next relation in time O(n/p) .

I f the " + " operation is associative and "invertible," then we can solve tree recursions in time O ( n / p + l o g p ) . Form Taux, set the value of u L to 0 and the value of UR to u.val; and perform parallel prefix on the linked list. Now F(u) is simply the partial sum in UR minus the partial sum in UL.

5.3. Unicycular Graphs. In this section we show how a "unicycular" graph can be converted into a forest with the same connected components. Although algorithms on unicycular graphs are not by themselves particularly important, this routine is useful in later sections. Informally, a unicycular graph consists of a cycle with trees attached to the nodes of the cycle. Formally, a unicycular graph is a connected graph in which each node has indegree one. Since each node in a unicycular graph has a unique parent (its unique predecessor), we can define on such a graph the mapping Next used in the last section. Let Gau x be the graph created from a graph G as Taux was created from T: the nodes of Gaux consist of two copies of each node of G, and uv is an edge of G, ux iff u.Next = v. The following lemma focuses on graphs consisting of one connected component and characterizes the corresponding auxiliary graph.

LEMMA 5.2. Let C be a unicycular graph consisting of one connected component and let Cau x be defined as above, then:

(1) The graph Ca,x consists of exactly two cycles. (2) The nodes UL and UR occur in distinct cycles of Caux iff u is on the cycle of C.

PROOF. It is easy to see that the Next relation is well defined on all nodes of C~u• and that it is one-to-one. It follows that C~ux is a union of cycles.

The graph C can be t ransformed into a tree by deleting one edge, say uv, from its cycle. From the definition of Next, such a deletion causes the deletion of the


edge outgoing vr and the change of the edge incoming v~ in Caux, thereby producing a linear graph. Such a transformation is only possible if Caux contains exactly two cycles.

A simple path in Caux corresponds to a sequence of steps in a depth-first search in C (the search follows edges in the reverse direction, i.e., moves from parent to its children); the occurrence of UL in such a path corresponds to a forward move to u in the search and an occurrence of UR corresponds to a backtrack move to u in the search. If u is on a cycle of C then a depth-first search started at u reaches u again in a forward move; the path starting at UL in Caux reaches UL again without encountering UR. Conversely, if u is not on a cycle then u is the root of a tree in C. A depth-first search that starts at u, eventually returns to u in a backtrack move; the path in Cau• that starts at UL reaches UR- Thus, a simple path connects /3 L to UR in Cau x if[ a u is not on a cycle of C. []

Let G be a graph that is the union of disjoint unicycular components. The graph G' is a forest decomposition of G if G' is obtained from G by deleting one edge from each cycle, thereby replacing each unicycular component with a tree.

THEOREM 5.3. Let G be a graph with n nodes that is the disjoint union of unicycular components, represented by its adjacency list. The adjacency list of a forest decomposition of G can be computed in the EREW model with p processors in time O(n /p+logp) .

PROOF. The following algorithm suffices. The idea is to identify a single edge on each cycle that can be deleted. We choose the smallest node UL whose companion node UR is on a different cycle. The following makes this precise.

(1) Construct the linked structure, Gau x defined by the Next relation. (2) Assign a distinct label to each circular list (e.g., the least name, in lexicographic

order, of an item in the list); label each item with the label of the list to which it belongs.

(3) Mark those items UL such that uR occurs in a list distinct from the list containing UL.

(4) Mark on each list the least item UL that was marked in (3). (5) Delete from the original adjacency list the edge leading to u, for each item

UL that was marked in (4).

Each of these phases can be implemented to run within the required time. []

6. Graph Problems. In this section we use the techniques described above to solve various graph problems.

6.1. Spanning Forest and Connected Components. The connected-components problem is to label all the nodes in an undirected graph with an identifier so that any two nodes in the same component have the same identifier and any two nodes in different components have different identifiers. The spanning-forest problem is to find a spanning tree for each connected component. Connected


components can be derived from a spanning forest by creating a preorder list for each tree, and marking each node with the label of the first node in its list. We thus present an algorithm for finding a spanning forest.

Our algorithm for finding a spanning forest is in the spirit of most of the previous algorithms for this problem: (super)nodes are continually combined into larger supernodes. When no more combining is possible, each supernode will represent a connected component and a spanning tree will have been created for each component.

We divide the algorithm into two phases. The first phase is efficient when n and e are larger than p and the graph is relatively sparse. It produces an intermediate graph that is dense with few (super)nodes to which we apply the second phase.

6.1.1. Phase L Let G be an (undirected) graph with e edges and n nodes. Two auxiliary graphs are used: the graph Gf is a forest that is a subgraph of G; and the graph Gs defines the supernodes of G. Each connected component (tree) of Gf is represented by one supernode in Gs. Two supernodes are adjacent if the corresponding trees are connected by an edge in G. Initially Gf contains all the nodes of G and no edges; Q is initially identical to G. At each iteration supernodes that are connected in Gs are collapsed into a single new supernode; edges are added to Gr to merge the corresponding trees into a new tree.

The graph Gf is represented by marking the adjacency-list entries for G that belong to Gr. An explicit representation of Gs would take too long to update. Instead, for each supernode S, we keep a list of edges uv in G such that u c S. The graph represented by these adjacency lists may 'have multiple edges and self-loops (i.e., the adjacency list of S may have several edges that connect S to the same supernode S', as well as edges uv such that u, v c S).

For each supernode, a membership list is maintained of the nodes that belong to this supernode. For quick, parallel access, these lists are stored in contiguous locations. In addition, each supernode has a weight count, which is the number of nodes belonging to it. Finally, the supernode to which a node belongs can be determined from an inverse directory. Initially, this structure is the identity map and it is updated whenever supernodes are combined into other supernodes.

Supernodes S~, $ 2 , . . . , St are combined or merged into a supernode S' by appending the membership lists of the Si to the membership list of S', updating the inverse directory for every node on the membership lists of the Si, and appending the adjacency lists of the Si to the adjacency list of S'.

In order to limit updates to the inverse directory, we merge supernodes having shorter memberslaip lists into those with larger lists and limit the number of times a list can be appended. The former is enforced by maintaining a weight or count of the size of a membership list and the latter requirement is enforced by inactivating supernodes whose weight is above a threshold w. Phase II is then applied to these inactive supernodes.

The basic iteration first chooses an edge from each active supernode; then forms groups of supernodes that are connected by these edges; and finally, for each group, merges all the supernodes into the supernode with the largest weight.


(See Figure 3.) Note that it is possible for active supernodes to be merged into inactive ones but since there can be at most one inactive supernode in a group and it has the largest weight, inactive supernodes cannot get merged into anything else. Note also that supernodes with empty adjacency lists are also inactive; their members represent whole components that are disconnected from the rest of the graph.

Let w be the threshold weight that determines when a supernode becomes inactive. Phase I terminates when there are no more than x active supernodes. A list of the active supernodes is stored in contiguous locations in an array and is repacked at the end of each iteration of Phase I.

Each iteration of Phase I performs the following steps until there are less than x active supernodes:

(a) Remove the first edge from the adjacency list of each active supernode of Gs. Call the set of removed edges E. Note, these edges are really (supernode, node) pairs.

(b) Convert edges in E to (supernode, supernode) pairs by performing a lookup on the second component; delete self-loops. The graph defined by the remain- ing edges consists of a union of components, of which some are trees and the others are unicycular graphs.

(c) Create a generating forest F for the graph, by deleting edges from E (Theorem 5.3). Form a preorder list of the supernodes in each tree in F.

(d) Mark the edges of G that corresponds to edges in F as belonging to Gf. (e) In each tree of F select the supernode of largest weight, breaking ties by

node name (this is the new supernode), and label all the supernodes in the tree with the name of the selected supernode. Mark all supernodes from E that change name as inactive.

(f) Update the membership list and inverse directory for supernodes. If the size of the new membership list is greater than w, mark the supernode as inactive.

(g) Link the adjacency lists of supernodes inactivated in (f) to the adjacency list of the new supernode to which they belong.

(h) Pack the list of active supernodes.

6.1.2. Phase II. Eckstein [E] has shown that a spanning forest can be found in O ( e / p + n ) time using either depth-first or breadth-first search on a CREW machine [E]. The algorithm visits the nodes sequentially and visits the edges in parallel. It can be extended to run in time O ( e / p + n log(e/n)) in the EREW model. This is optimal when the number of processors is small relative to the average degree of the graph, i.e., p log p-< O(e /n ) . We use this algorithm as the base case in our algorithm when there are few supernodes.

LEMMA 6.1. A spanning tree can be built from a graph containing e edges, n nodes, in the EREW model with p processors in time O(e /p + n log(e/n)).

PROOF. We assume that the adjacency list of each node is stored in consecutive locations (this can be achieved in time O((e + n)/p +log(e + n))). Perform a depth-first search on the graph by maintaining a queue. The queue initially

(i)

t o c

I\./\ \ p I b \i~o~~//o

h

/ \ ~

e

(ii)

f a - C h

--.....

(iii)

f a - - ~ - - - - - - - - - - - ~ C h

p

i

(iv)

Fig. 3. Example of connected components. (i) The input graph. (ii) A unicyclic forest is created by choosing the first edge from each supernode. (iii) One edge is deleted from each unicycular tree. (iv) Supernodes and their edges that are left to be processed in the next iteration.

58 C . P . Kruskal , L. Rudo lph , and M. Snir

contains a single node. One processor deletes a node, say u, from the queue, computes the number of nodes adjacent to it, say du, and then broadcasts the node number to min(du, p) processors. Each active processor then looks at [du/p ] of the nodes adjacent to u. Each such node that is not marked, gets marked and added to the queue. This algorithm thus takes O ( ~ ([dJp]+log d~)) parallel time. This achieves a maximum when all the di are equal. Since, E di = 2e, let d~ =2e/n. This yields the desired result. []

6.1.3. Analysis. We must now calculate the time required for both phases of the algorithm. In doing so, we must carefully choose values for w and x that give the proper balance between the two phases.

The data structures can be efficiently initialized. The initial adjacency list for Gs can be created in O(e/p) steps; it can be packed in time O(n/p+logp). The membership list and inverse directory for supernodes can be created in O(n/p) time as well as modifying the data structures from the results of Phase I to those required for Phase II.

We set x to be equal to n~ w, which is the (greatest) lower bound on the number of active supernodes. Let y be the number of inactive supernodes at the end of Phase I that are nonisolated; each of these have membership lists greater than w, so that

n y < - - . W

Phase II is applied to at most x+y <2n/w supernodes. Let n~ be the number of active supernodes at the start of iteration i. Then

n n i ~ x = - .

W

We first compute the time required for each step in the basic iteration, except for step (f). The edges picked in step (a) are (supernode, node) pairs. They are converted to (supernode, supernode) pairs by composing the mapping represented by the pairs picked in (a) with the mapping represented by the supernode inverse directory. By Theorem 4.3, this requires time

Time per iteration = 0 log(hi~p)/

and space O(pnT+ni). By Theorem 5.3, each iteration of step (c) can be done in this time bound and from the techniques of Section 2, the marking and packing of steps (d), (e), (g), and (h) can be done in this time also.

At each iteration one edge is deleted per active supernode, we thus have

~ n i ~ e . i


The total t ime for all iterations of all steps excluding step (f) is

o,o, /me s log(hi~p)/"

Using the bounds in (6.1) and (6.2) we obtain

Total t ime~f=O(p log n log(n/pw)/"

We now estimate the time spent in step (f) in updating membership lists and inverse directories for supernodes. Let mi be the number of nodes that change supernode membership at iteration i. These nodes can be copied into new supernode membership lists, and their inverse directories can be updated in time

Since we always merge lists with fewer nodes onto lists with a larger number of nodes and the maximum size of a list to be merged is at most w, we can bound the total number of updates done throughout the computation:

mi --< n log w.

Also, since at least n/w edges are deleted at each iteration, the total number k of iterations is bounded by

e w e k<-

D / W - - n "

The maximum of the right-hand side of equation (6.1) under these two constraints is achieved when all the m~ are equal, and so,

Total time r = 0 log . p n ew

During Phase II, there are at most 2n/w nodes and e edges, and so

Total timepuaseu = 0 +-~ log p .

Hence the total time of the algorithm (both phases) is bounded by

lo _, ( ,o w+ew )) (6.2) O log(n/wp)] + p n iog~ ew +\p- --Iogpw "


THEOREM 6.2. A spanning forest for a graph with n nodes and e edges (e>-2p 2 log p) can be computed by an EREW model with p processors in time

log(e / (p21ogp) )+ l ogp and space O(pn~ +e) for any constant e>O.

PROOF. The result follows from choosing a correct value of w in equation (6.2). Let

np log p W - - - -

The above timing result then follows. A naive implementat ion of this algorithm would require space f~(nw), as an

array of size w has to be set aside for each membership list. A more complex memory management scheme can reduce that to O(npt/k), for any fixed k: the arrays are stored as B-trees of degree O(pl/k). []

The algorithm achieves a linear speedup for e = ~ ( n log p) and e / (p2 log p) = n ml~. This latter condition simplifies to e = l ) (p 2+~) for any constant e > 0.

COROLLARY 6.3. The connected components of a graph with n nodes and e edges (e->2p 2 log p) can be computed in the EREW model with p processors in time

) 0 log(e / (pZlogp) ) t - - l o g p p and space O(pn~+e) for any constant e>O.

6.2. Biconnected Components. Tarjan and Vishkin [TV] developed a parallel algorithm to compute the biconnected components of a connected graph, by reducing the problem to connected components. Given our algorithms to compute the spanning forest of a graph as well as the ability to construct preorder and postorder numberings and lists of a tree, we can use their method but improve the result:

COROLLARY 6.4. The biconnected components of a graph with n nodes and e edges (e>-2p21ogp) can be computed in the EREW model with p processors in time

O ,-7--~ + - i o g p and space O(pn~+e) for any constant e > 0 . l o g ( e / ( p - l o g p ) ) p

PROOF. The algorithm of Tarjan and Vishkin requires the computat ion of a spanning tree for the graph, the computat ion of the preorder number of each node in this spanning tree, and the solution of several tree recurrences. We can perform each of these operations within the claimed time bound. []

6.3. Minimum Spanning Tree. The spanning-forest algorithm just presented can be modified to give a minimum spanning-tree algorithm. A sequential algorithm


General Parallel Algorithm for MST Gf = ( V, empty); G~ = G;

while Q has more than one node do begin

1. pick least-cost outgoing edge from each node of Gs; let U be the subgraph of G containing these edges (U is the disjoint union of unicycular graphs);

2. delete from each cycle of U an edge (note that all edges of the cycle have the same cost); let F be the resulting graph (F is a forest);

3. add the edges of F to Gf; 4. combine supernodes of Gs that belong to the same tree of F into one supernode

end

Fig. 4

for minimum spanning tree merges supernodes iteratively. At each iteration a supernode is chosen; the least-cost edge exiting thi s supernode is used to merge it with another supernode. The order in which the supernodes are chosen is arbitrary (see [CT]).

A parallel version of this general strategy merges supernodes in parallel provided that the merging is consistent with some serial execution of the above algorithm. Let Gs, the supernode graph, and Gr, the spanning-forest graph, be defined as in the previous algorithm. A general parallel algorithm that finds a minimum cost spanning tree for a connected graph G =(V, E) has the form shown in Figure 4. The number of supernodes is at least halved at each iteration, so that at most log n iterations are performed.

This general algorithm is implemented using data structures similar to those used for the spanning-tree algorithm: The spanning forest Gf is represented by marking edges in the adjacency list of G. Each edge is represented by a quintuple (node1, supernodel, node2, supernode2, weight). The supernode graph Gs is represented by maintaining for each supernode a list of pointers to incident edges along with a flag indicating which of the two endpoints belongs to the supernode. We assume that p < n, e.

(a) A least-cost edge is picked in each adjacency list of Gs--breaking ties by node number--by running a product-broadcast algorithm on the lists of edges. This requires time O(e/p+logp). Let U be the unicycular forest defined by these edges.

(b) An edge is deleted from each cycle of U to create a forest F. (When each node picks the minimum weighted edge--breaking ties by node number--the cycle can only be of length two and can be broken using only a constant number of local operations.)

(c) The new edges are added to Gf in time O(e/p). (d) All the supernodes in each tree are ordered into a list and their incidence

lists concatenated. The supernode fields of the incident edges are then updated. All this requires time O(e/p + log p).

(e) Self-loops are deleted from the set of edges and from the incidence lists in time O(e/p).

Each iteration requires time O(e/p+logp). Since there are at most log n iterations we obtain the following theorem.


T a b l e 1

Linear Problem Model Time speedup Reference

Prefix (linked list), n log n tree computations EREW - - [W]

P

n log n EREW p -< n 1-~ [ KRS]

p log(2n/p)

Radix sort n log R range {1 . . . . , R} EREW p-< n 1-~ This paper

p log(n /p)

n log n Graph conversion EREW p -< n~-~ This paper

p log(2n/p)

Spanning forest, e log n CRCW - - [ A S ] connected components p

e log n CREW - - + ( l o g n)(log p) - - [KR]

P

e log n n EREW + - l o g p p-< e t/2-~ This paper

P log(e / (p21ogp) ) P

Biconnected components CRCW

Minimum spanning tree

e l o g p _ < -

n

e log n - - [TV]

P

e log n n EREW + - l o g p p <- e ~/2-~ This paper

P log(e / (p21ogp) ) p

e l o g p - < -

n

e log n CRCW p -< e [AS]

P

CREW e log n+log n log p P-< e [KR] p log e

e log n e EREW - - + log n log p p < This paper

p log e


THEOREM 6.5. A minimum-cost spanning tree o f a connected graph can be computed in an EREW model with p <- e, n processors in time

O((e log n ) / p + l o g p log n).

Note. This algorithm is efficient relative to Kruskal's algorithm provided that p-< O(e/log e).

7. Conclusion. Many simple and fast serial algorithms are often hard to parallel- ize. In many cases there is enough work that can be performed in parallel, but the challenge is to ensure that the processors do not conflict when performing this work and that different processors do not replicate the same computation. The problem is harder when sparse structures are handled: if a compact data representation is used then the data layout is irregular and it is hard to distribute work efficiently among processors. If, on the other hand, a regular data structure is used then superfluous work is performed.

We have presented techniques for working with sparse, irregular structures, and demonstrated the use of these techniques to solve several important graph problems. We believe these techniques are generally applicable to other sparse problems. It is still unknown whether large, extremely sparse graph problems can be solved efficiently, i.e., when p<< n and e = O(n). Table 1 summarizes our results and shows how they compare with previous results.

[AHU]

[AH]

[CT]

[CVl]

[CV2]

[CV3]

[E]

[K]

[KRS]

[KR]

[LF]

References

A. V. Aho, J. E. Hopcroft, and J. D. Ullman, The Design and Analysi's of Computer Algorithms, Addison-Wesley, Reading, Mass., 1974. M. J. Atallah and S. E. Hambrusch, Solving Tree Problems on a Mesh-Connected Processor Array, Proc. 26th Annual Symposium on Foundations of Computer Science, Oct. 1985, 222-231. D. Cheriton and R. E. Tarjan, Finding Minimum Spanning Trees, SIAM J. Comput., 1976, 724-742. R. Cole and U. Vishkin, Deterministic Coin Tossing and Accelerating Cascades: Micro and Macro Techniques for Designing Parallel Algorithms, Proc. 18th Annual ACM Sym- posium on Theory of Computation, 1986, 206-219. R. Cole and U. Vishkin, Approximate and Exact Parallel Scheduling with Applications to List, Tree and Graph Problems, Proc. 27th Annual Symposium on Foundations of Computer Science, 1986, 478-491. R. Cole and U. Vishkin, The Accelerated Centroid Decomposition Technique for Optimal Parallel Tree Evaluation in Logarithm Time, Submitted for Publication. D. M. Eckstein, Parallel Processing Using Depth-First and Breadth-First Search, Ph.D. Thesis, University of Iowa, July 1977. D. E. Knuth, The Art of Computer Programming, Vol. 1, Addison-Wesley, Reading, Mass., 1973. C. P. Kruskal, L. Rudolph, and M. Snir, The Power of Parallel Prefix, IEEE Trans. Comput., 34, 1985, 965-968. S. C. Kwan and W. L. Ruzzo, Adaptive Parallel Algorithms for Finding Minimum Spanning Trees, Proc. 1984 International Conference on Parallel Processing, Aug. 1984, 439-443. R. Ladner and M. Fischer, Parallel Prefix Computation, J. Assoc. Comput. Mach., 1980, 831-838.


[MR] G. Miller and J. Reif, Parallel Tree Contraction and Its Application, Proc. 26th Annual Symposium on Foundations of Computer Science, 1985, 496-503.

[Sc] J.T. Schwartz, Ultracomputers, A C M Trans. Program. Lang. Systems, 1980, 484-521. [Sn] M. Snir, On Parallel Searching, Proc. A C M Symposium on Distributed Computing, Aug.

1982, 242-253. [St] Q. F. Stout, Tree-Based Graph Algorithms for Some Parallel Computers, Proc. 1985

International Conference on Parallel Processing, Aug. 1985, 727-731. [TV] R. E. Tarjan and U. Vishkin, Finding Biconnected Components and Computing Tree

Functions in Logarithmic Time, Proc. 25th Annual Symposium on Foundations of Computer Science, Oct. 1984, 12-20.

IV] U. Vishkin, Randomized Speedups in Parallel Computations, Proc. 16th Annual A C M Symposium on Theory of Computing, April 1984, 230-239.

[W] J.C. Wyllie, The Complexity of Parallel Computations, Ph.D. Dissertation, Department of Computer Science, Cornell University, Ithaca, N.Y., 1979.

Documents

Efficient parallel algorithms for graph problemsEfficient Parallel Algorithms for Graph Problems 45 Subsequently, various parallel algorithms achieving linear speedup for solving the