A faster algorithm for solving linear algebraic equations on the star graph

J. Parallel Distrib. Comput. 63 (2003) 465–480

A faster algorithm for solving linear algebraic equationson the star graph

Ramesh Chandraa and C. Siva Ram Murthyb,�,1

aDepartment of Computer Science, Stanford University, Stanford, CA 94305, USAbDepartment of Computer Science & Engineering, Indian Institute of Technology, Madras 600036, India

Received 24 April 2000; revised 30 January 2003; accepted 7 February 2003

Abstract

The problem of solving a linear system of equations is widely encountered in many fields of science and engineering. In this paper,

we present a parallel algorithm to solve the above problem on a star graph. The proposed solution (i) is based on a variant of the

Gaussian elimination algorithm (GE) called the successive Gaussian elimination algorithm (SGE) (IEE Proc. Comput. Digit. Tech.

143 (4) (1996)) and (ii) supports partial pivoting to provide numerical stability. We present efficient matrix distribution techniques

on the star graph. Our proposed parallel algorithm employs these techniques to reduce communication overhead during matrix

operations on the star graph. We estimate the performance of our parallel algorithm and demonstrate its effectiveness by comparing

it with a recent algorithm for the same problem on star graphs (IEEE Trans. Parallel Distrib. Systems 8 (8) (1997) 803).

r 2003 Elsevier Science (USA). All rights reserved.

Keywords: Linear system; Matrix decomposition; Gaussian elimination; Parallel algorithm; Star graph

1. Introduction

The problem of solving a system of N linear equationsAx ¼ b (where A is a known N � N matrix, and b and x

are, respectively, the known and unknown N � 1vectors) is frequently encountered in many fields ofscience and engineering. Efficient numerical methods forsolving this problem on uniprocessor systems have beendeveloped and reliable and high-quality codes areavailable for different cases of linear systems [21].Recent advances in networking technology coupled withthe increasing microprocessor speeds have led to awidespread interest in the use of multiprocessor systemsfor solving many practical problems.

Gaussian elimination (GE) is one of the most popularmethods for solving the above-mentioned problem. Thestandard GE algorithm consists of two phases, namely,a triangularization phase and a back-substitution phase.In the triangularization phase, the linear system Ax ¼ b

is transformed into Ux ¼ c; where U is an uppertriangular matrix. The solution set is then obtained byperforming the back-substitution phase. Since the GEalgorithm on a uniprocessor requires OðN3Þ computa-tional steps to solve a system of N linear equations, a lotof research has been directed toward developing parallelGE implementations on different multiprocessor sys-tems. A survey of several of the important algorithms inparallel numerical algebra can be found in [9,11,13,17].

There is a growing interest in the star graph as adesirable alternative for massive parallel computing. Ithas been proposed in [1] as an attractive alternative tothe hypercube for interconnecting processors in parallelcomputers. The star graph is superior to the hypercubein three key properties: It has a lower degree, smallerdiameter, and a smaller average diameter than ahypercube with a similar number of processors [1,8].So, the star graph has fewer communication links andsmaller communication delays compared to the hyper-cube. The other desirable properties of the star graphinclude its regularity, vertex and edge symmetry,hierarchical structure, fault tolerance, and strongresilience. The topological properties of the star havebeen analyzed in [8]. Algorithms developed for the starinclude: broadcasting [14,24], sorting [19,22], embedding

�Corresponding author. Fax: +91-44-22578352.

E-mail addresses: [email protected] (R. Chandra),

[email protected] (C.S.R. Murthy).1This work was supported by the Department of Science and

Technology, New Delhi, India.

0743-7315/03/$ - see front matter r 2003 Elsevier Science (USA). All rights reserved.

doi:10.1016/S0743-7315(03)00031-5

[4,7,15,16,20,23], fault tolerance [5,12], routing [8,22],computing FFT [10] and star graph variants [18]. Themajor disadvantage of the star is its poor scalabilitysince the number of processors increase as n!, where n isthe dimension of the star. To circumvent this difficulty,a variant of the star called the incomplete star has beenproposed in [18]. The incomplete star allows incrementalscalability which considerably improves the scalabilityof the network.

Recently, a star-graph-based parallel algorithm hasbeen proposed in [3] to solve a linear system ofequations. In this paper, we present an efficient parallelalgorithm to solve the same problem on the star graph.Our algorithm is based on the successive Gaussianelimination algorithm (SGE) [6]. We compare ouralgorithm with the one presented in [3] and show thatour algorithm is superior in performance.

The rest of the paper is organized as follows. InSection 2, we first briefly describe the star graph alongwith its properties and then give a brief overview of theexisting solution to the above-mentioned problem on thestar graph. In Section 3, we first briefly describe the SGEalgorithm and present two efficient matrix distributiontechniques for the star graph. Then we present ourparallel algorithm for the above-mentioned problem. InSection 4, we compare our algorithm with the onepresented in [3]. Finally, in Section 5, we present ourconclusions.

2. Existing star-graph-based solution—AD

In this section, we give a brief overview of the existingalgorithm to solve a linear system of equations on thestar graph. This algorithm was presented in [3] by Al-Ayyoub and Day, and we refer to it as the AD algorithm

in the remainder of this paper. Before we discuss the ADalgorithm, we briefly describe the star graph andsummarize some of its important properties.

2.1. The star graph

The star graph of dimension n; denoted by Sn; has aset of n! processors corresponding to all the n!permutations of n distinct symbols /nS ¼ f1; 2;y; ng;and a set of n � 1 generators g2; g3;y; gn; where gi is thetransposition of the symbol in the ith position with thesymbol in the first position. The processor u correspond-ing to the permutation p1p2ypn; where piA/nS;1pipn; is denoted by u ¼ ðp1p2ypnÞ: A processor u ¼ðp1p2ypnÞ is connected to the processor uðiÞ ¼ðpip2ypi�1p1piþ1ypnÞ; for 2pipn (i.e., the processorcorresponding to the permutation p1p2ypn is connectedto all those processors whose corresponding permuta-tions result from interchanging the first symbol inp1p2ypn with any of the remaining n � 1 symbols).Processor uðiÞ is obtained from the processor u byapplying the generator gi on u; and hence the linkconnecting u and uðiÞ is said to be of type gi:Consequently, we see that the degree of each processorin Sn is n � 1: Sn is regular and vertex and edgesymmetric. It has a diameter of I3

2ðn � 1Þm; and an

average path length of n þ ln n þ Oð1Þ [2]. The minimumdistance between a pair of processors u and v in Sn isdenoted by dðu; vÞ: Fig. 1 shows S3 and S4; the stargraphs with dimensions 3 and 4, respectively.

Because of its rich symmetry, Sn is easily extensibleand can be partitioned in a number of ways. We describebelow two partitioning methods used in the remainderof the paper. In the first partitioning scheme, the set ofprocessors in Sn are decomposed into n disjoint subsetsI1; I2;y; In; where Ik is the subset of processors that end

Fig. 1. (a) The 3-dimensional star graph, S3: (b) The 4-dimensional star graph, S4: Note that S4 consists of four S3’s connected appropriately.

R. Chandra, C.S.R. Murthy / J. Parallel Distrib. Comput. 63 (2003) 465–480466

with symbol k [1]. Each of the subsets Ik is isomorphicwith the n � 1-dimensional star Sn�1 as shown for thecase n ¼ 4 in Fig. 1. In the second scheme, the set ofprocessors in Sn are partitioned into n disjoint subsetsX1;X2;y;Xn; where Xk is the subset of processors thatstart with the symbol k: The subsets I1; I2;y; In andX1;X2;y;Xn have ðn � 1Þ! processors each.

A number of processor ranking schemes have beenproposed for the star graph. We describe below one ofthem that we use in the remaining sections [19].

Definition 1. Let Gn be a one-to-one mapping fromthe set of permutations fðp1p2ypnÞ j piA/nS; piapj

for any jaig onto the set of integers f1; 2;y; n!g: Forany permutation p1p2ypn; a unique integer can begenerated using the following recursive function

Gnðp1p2ypnÞ

¼1 if n ¼ 1;

½ðpn � 1Þðn � 1Þ!�þGn�1ðq1q2yqn�1Þ otherwise;

8><>:

where q1q2yqn�1 is obtained from p1p2ypn afterdropping pn and renumbering the remaining symbolsfrom 1 to n � 1: For any permutation u ¼ ðp1p2ypnÞ;we use Gn�kðuÞ to denote Gn�kðq1q2yqn�kÞ for1pkpn � 1 and q1q2yqn�k is obtained after droppingthe last k symbols from u and renumbering theremaining symbols from 1 to n � k:

We state below the propositions that we employ in thelater sections. We borrow these propositions from [3]and they are proved in the same.

Proposition 1. Any processor u in Xk is connected to

exactly one processor v in Ik; by a link of type gn in Sn:

Proposition 2. For any two processors u ¼ðp1p2ypn�1kÞAIk and v ¼ ðq1q2yqn�1k þ 1ÞAIkþ1 such

that Gn�1ðuÞ ¼ Gn�1ðvÞ; we have dðu; vÞp3:

Proposition 3. For any two processors u ¼ðkp2p3ypnÞAXk and v ¼ ðk þ 1q2q3yqnÞAXkþ1 such

that Gn�1ðuðnÞÞ ¼ Gn�1ðvðnÞÞ; we have dðu; vÞ ¼ 1:

Proposition 4. For any two processors u ¼ðkp2p3ypnÞAXk and v ¼ ðkq2q3yqnÞAXk; we have

dðu; vÞX3:

2.2. The AD algorithm

The existing solution to the problem of solving alinear system of equations on Sn; which we call the ADalgorithm, is presented in [3] by Al-Ayyoub and Day.

The AD algorithm consists of two phases—the matrixdecomposition (or triangularization) phase and theback-substitution phase. Only the matrix decompositionphase is considered in [3]. In this section, we brieflydiscuss the matrix decomposition phase of [3] and thenpresent an efficient back-substitution phase to go alongwith it so that the AD algorithm provides a complete

solution to a linear system of equations.Two cyclic matrix distribution techniques, namely,

the star cyclic matrix distribution (SCMD) and the lineararray cyclic matrix distribution (LCMD) have beenproposed in [3]. These matrix distribution techniquesdistribute the matrix over Sn in a cyclic fashion. Thecyclic matrix distribution techniques offer better loadbalancing at the expense of increased communicationcost. Since all the elements of any row of the matrixreside in a single ðn � 1Þ substar in SCMD, SCMD is usedfor row communication. Since all the elements of anycolumn of the matrix are connected by a linear array inLCMD, LCMD is used for column communication. A matrixin SCMD can be converted to LCMD and vice versa in asingle communication step.

We now briefly describe the two phases of the ADalgorithm.

2.2.1. The matrix decomposition phase

The matrix decomposition phase of the AD algo-rithm, presented in [3], requires N � 1 steps to decom-pose a matrix of order N to an upper triangular form.Initially, the known vector b is appended to the N � N

matrix A to obtain the N � ðN þ 1Þ augmented matrixAjb: The matrix Ajb is then distributed on Sn using thecyclic matrix distribution techniques. At the kth step,the task sequence performed by the processors can beinformally given as follows.

1. All the processors containing the pivot columnelements perform partial pivoting. At the end of thisstep all the pivot column processors contain the valueof the pivot element.

2. The processors containing the pivot row elementsconcurrently broadcast the pivot row element valuesalong the respective columns.

3. The processors containing the pivot column elementsconcurrently compute the multiplier values for eachrow and then concurrently broadcast these multipliervalues along the respective rows.

4. All the processors which do not contain either thepivot row elements or pivot column elements usethe multiplier values and the pivot row elementvalues to update the values of the elements present inthem.

The above sequence of tasks are repeated N � 1 times,at the end of which the linear system Ax ¼ b istransformed into Ux ¼ c; where U is an uppertriangular matrix. The solution vector x is obtained by

R. Chandra, C.S.R. Murthy / J. Parallel Distrib. Comput. 63 (2003) 465–480 467

back substitution of vector c in the matrix U : A moreformal discussion of the above matrix decompositionphase can be found in [3].

2.2.2. The back-substitution phase

The algorithm presented in [3] gives only the matrixdecomposition phase of the AD algorithm which trans-forms the linear system Ax ¼ b into Ux ¼ c; where U isan upper triangular matrix. Our algorithm, which wewould describe in a later section, provides a complete

solution to a linear system of equations. Since we wouldevaluate the performance of our algorithm against theAD algorithm, the AD algorithm should also provide acomplete solution. This necessitates that a back sub-stitution of the vector c in the matrix U should followthe matrix decomposition phase in the AD algorithm. Inthis section, we develop an efficient star-based back-substitution phase to go along with the matrix decom-position phase of the AD algorithm. The tasks involvedin the back-substitution phase of the AD algorithm areoutlined below.

In the above task listing, we assume that the reducedupper triangular matrix U ¼ fui;j j 1pi; jpNg; thereduced known vector c ¼ fci j 1pipNg; and theunknown vector x ¼ fxi j 1pipNg: We also assume

that results after the matrix decomposition phase arestill in place in the respective processors. In the abovetask listing, after a new xkAx is calculated, its value isused to update the value of c vector so that the nextvariable xk�1 can be calculated.

After identifying the various tasks in the back-substitution phase, these tasks have to be distributedamong the different processors ofSn:We give below thealgorithm executed by the processor u ¼ ðp1p2ypnÞ ofSn during the back-substitution phase. The matrix isinitially distributed using the SCMD distribution. Thedescription of the notation used in the algorithm belowis as follows.

* The processor u in SCMD is denoted by PR;C ; where,PR;C is the processor at row R and column C in then � ðn � 1Þ! processor grid representation of Sn usedin SCMD.

* The processor u in LCMD is denoted by P%R;

%C; where,

P%R;

%C is the processor at row

%R and column

%C in the

n � ðn � 1Þ! processor grid representation of Sn usedin LCMD.

* lR is the maximum integer such that R þ lRnpN:* The procedure broadcast linear array performs

broadcasting in a linear array in LCMD.* The procedure exchange submatrices is used to

exchange submatrices between PR;C and P%R;

%C (i.e.,

between SCMD and LCMD) and vice-versa.

A detailed explanation of the above notation can be foundin [3]. It is to be noted that the above notational descriptionis valid for this section only. We redefine the abovenotation in later sections for use in our algorithm.


The above back-substitution algorithm consists of N steps.At the kth step, first the processors containing theðk þ 1Þth column of matrix U send in parallel the updatedvector c to the processors containing the kth column of U :Then the processor containing uk;k calculates xk andbroadcasts it to all the processors containing kth column ofU : These processors use the value of xk to update theelements of vector c in parallel. At the end of the N steps ofthe back-substitution phase, the solution to the linearsystem is obtained.

With the addition of the back-substitution phase, theAD algorithm is completely in place. In the next section,we present our algorithm for the problem of solving alinear system of equations on Sn: We then demonstratethe superior performance of our algorithm by compar-ing it with the AD algorithm.

3. Our algorithm for the star graph

In this section, we present our star-graph-basedparallel algorithm to solve the above-mentioned pro-blem. Our algorithm is based on the SGE algorithm. Wefirst briefly describe the SGE algorithm as given in [6].Then we present efficient non-cyclic matrix distributiontechniques for distributing a matrix on Sn: We thenpresent our algorithm which makes use of these matrixdistribution techniques to reduce communication over-head in matrix operations, thus providing faster solutionto the problem considered here.

3.1. The SGE algorithm

It is well known, in Ax ¼ b; that the value of xi

ði ¼ 1; 2;y;NÞ depends on the value of xj

ð j ¼ 1; 2;y;N and iajÞ; indicating the ðN � 1Þth leveldependency. Hence, the influence of ðN � 1Þ otherunknowns has to be unraveled to find one solution. Inthe GE algorithm, the value of xi ði ¼ 1; 2;y;NÞ isfound by eliminating its dependency on xk ðkoiÞ in thematrix decomposition (or triangularization) phase andxk ðk4iÞ in the back-substitution phase.

Two linear systems are said to be equivalent providedthat their solution sets are the same. Accordingly, thegiven set of N linear equations can be replaced by twosets of equivalent linear independent equations with halfthe number of unknowns by eliminating N

2variables each

in forward (left to right) and backward (right to left)directions. This process of replacing the given set oflinear independent equations by two sets of equivalentequations with half the number of variables is usedsuccessively in the SGE algorithm. The algorithminvolves the following steps to find the solution vector,x; in the equation Ax ¼ b [6].

1. First, we form an augmented matrix Ajb (i.e., the b-vector is appended as the ðN þ 1Þth column) of orderN � ðN þ 1Þ: This matrix is duplicated to form A0jb0

and b1jA1 (where the b-vector is appended to the leftof the coefficient matrix A) with A0 and A1 being thesame as the coefficient matrix A; and similarly, b0 andb1 being the same as the b-vector. Note the b-vector isappended to the left or right of the coefficient matrixonly for programming convenience.

2. Using the GE method, we triangularize A0jb0 in theforward direction to eliminate the subdiagonalelements in the columns 1; 2;y; ðN

2Þ to reduce its

order to N2� ðN

2þ 1Þ (ignoring the columns eliminated

and the corresponding rows). Concurrently, wetriangularize b1jA1 in the backward direction toeliminate the superdiagonal elements in the columnsN; ðN � 1Þ;y; ðN

2þ 1Þ to reduce the order of A1jb1

also to N2� ðN

2þ 1Þ (again ignoring the columns

eliminated and the corresponding rows). With this,A0jb0 may be treated as a new augmented matrix withcolumns and rows ðN

2þ 1Þ; ðN

2þ 2Þ;y;N and the

modified b-vector appended to its right, and similarlyb1jA1 as a new augmented matrix with columns androws 1; 2;y; ðN

2Þ and the modified b-vector appended

to its left.3. We duplicate the reduced augmented matrices A0jb0

to form A00jb00 and b01jA01; and b1jA1 to formA10jb10 and b11jA11 (each of these duplicated matriceswill be of the same order, N

2� ðN

2þ 1Þ). We now


triangularize A00jb00 and A10jb10 in the forwarddirection and b01jA01 and b11jA11 in the backwarddirection through N

4 columns using the GE method,thus reducing the order of each of these matrices tohalf of their original size, i.e., N

4� ðN

4þ 1Þ: Note that

the above four augmented matrices are reduced inparallel.

4. We continue this process of halving the size ofthe submatrices using GE and doubling the numberof submatrices log N times2 so that we end upwith N submatrices each of order 1� 2: The modifiedb-vector part, when divided by the modified A

matrix part in parallel, gives the complete solutionvector, x:

The solution of Ax ¼ b using the SGE algorithm isshown in Fig. 2 in the form of a binary tree for N ¼ 8:

3.2. Non-cyclic matrix distribution on the star graph

The distribution of matrix elements among theprocessors of a processor configuration is a veryimportant factor for effective parallel matrix computa-tion. The matrix distribution techniques should allow

effective communication during the various matrixoperations. In particular, parallel broadcasting acrossrows and columns should be effectively supported bythese techniques. Such techniques are easily achievablefor the mesh and hypercube architectures. An N � N

matrix can be distributed on an n � n mesh ðnoNÞ bydividing the N rows (columns) of the matrix into n rowgroups (column groups) such that each row group(column group) contains N

nconsecutive rows (columns).

The submatrix formed by the intersection of the ith rowgroup and the jth column group can be allotted to theprocessor at location ði; jÞ of the mesh. In a hypercube ofdimension n; matrix distribution can be done bypartitioning the n-bit binary addresses into 2

n2 disjoint

subcubes each of dimension n2: Thus, matrix elements can

be distributed over these disjoint subcubes such thatelements of the same row (column) reside in the samesubcube.

Matrix distribution on the star is not so obvious as inthe mesh and the hypercube. Since it is apparentlyunachievable to define a single distribution that allowsus to use star-based broadcasting in both row andcolumn directions, we define two non-cyclic matrixdistribution techniques, one which allows efficient rowbroadcasting and another which allows efficient columnbroadcasting. Similar cyclic matrix distribution techni-ques were used in [3]. The proposed non-cyclic matrix

Fig. 2. The SGE algorithm.

2 In this paper, log is the logarithm to the base 2 and ln is the natural

logarithm.


distribution techniques have lower communicationoverhead than the cyclic distribution techniques duringmatrix operations and hence are employed in ouralgorithm.

Let A ¼ fai;j j 1pipN; 1pjpMg be the set ofelements of a general N � M matrix to be distributedon Sn; and V the set of processors in Sn: Let mr ¼N � IN

nmn and mc ¼ M � I M

ðn�1Þ!mðn � 1Þ!: First wedefine the following useful functions:

f ðiÞ ¼J i

JNnnn if ipmrJN

nn;

mr þ Ji�mrJ

Nnn

INnm

n otherwise

8>><>>:

and

gð jÞ ¼

J j

J Mðn�1Þ!n

n if jpmcJ Mðn�1Þ!n;

mc þ Jj�mcJ

Mðn�1Þ!n

I Mðn�1Þ!m

n otherwise:

8>>><>>>:

Definition 2. The star matrix distribution is a functionSMD : A-V given by SMDðaijÞ ¼ v; such that vAIR andGn�1ðvÞ ¼ C; where R ¼ f ðiÞ and C ¼ gð jÞ: Such aprocessor is denoted by PR;C :

The function SMD distributes the matrix A over then � ðn � 1Þ! processor grid formed by the set of n

substars I1; I2;y; In; each containing ðn � 1Þ! proces-sors. Using SMD, a processor PR;C is assigned lR ¼ JN

nn

consecutive subrows if it belongs to any of the substarsI1; I2;y; Imr

(i.e., Rpmr), and it is assigned lR ¼ INnm

consecutive subrows if it belongs to any of the remainingsubstars Imrþ1;y; In (i.e., R4mr). Similarly, PR;C isassigned mC ¼ J M

ðn�1Þ!n consecutive subcolumns if itbelongs to any of the first mc processor columns (i.e.,Cpmc), and it is assigned mC ¼ I M

ðn�1Þ!m consecutivesubcolumns if it belongs to any of the remainingprocessor columns (i.e., C4mc). So, PR;C is assigned asubmatrix formed by lR consecutive subrows and mC

consecutive subcolumns. Fig. 3 shows an example ofdistributing a 6� 6 matrix on an S4 using the SMD

function.

Definition 3. The linear array matrix distribution is afunction LMD : A-V given by LMDðaijÞ ¼ v such thatvAX

%R and Gn�1ðvðnÞÞ ¼

%C; where

%R ¼ f ðiÞ and

%C ¼ gð jÞ:

Such a processor is denoted by P%R;

%C:

The function LMD distributes the matrix A overthe n � ðn � 1Þ! processor grid formed by the setsX1;X2;y;Xn; with each Xk containing ðn � 1Þ! proces-sors. Using LMD, a processor P

%R;

%C is assigned l

%R ¼ JN

nn

consecutive subrows if it belongs to any ofX1;X2;y;XmR

(i.e.,%RpmR), and it is assigned l

%R ¼

INnm consecutive subrows if it belongs to any of

Xmrþ1;y;Xn (i.e.,%R4mR). Similarly, P

%R;

%C is assigned

m%C ¼ J M

ðn�1Þ!n consecutive subcolumns if it belongs toany of the first mc processor columns (i.e.,

%Cpmc), and

it is assigned m%C ¼ I M

ðn�1Þ!m consecutive subcolumns if itbelongs to any of the remaining processor columns (i.e.,

%C4mc). So, P

%R;

%C is assigned a submatrix formed by l

%R

consecutive subrows and m%C consecutive subcolumns.

Fig. 4 shows an example of distributing a 6� 6 matrixon an S4 using the LMD function. It can be observed thatin LMD, each matrix column resides in a set of n

processors connected in a linear array.The following observations about SMD and LMD can be

easily verified:

1. Using SMD, elements of the same row are stored in thesame Sn�1: Therefore, the farthest distance between

Fig. 3. Matrix distribution using the SMD on S4:

Fig. 4. Matrix distribution using the LMD on S4:


any pair of processors holding the elements of thesame row in SMD is I3

2ðn � 2Þm:

2. Applying Proposition 3 to LMD, we observe that thecolumn elements in LMD are connected in a lineararray. So, the farthest distance between any twoprocessors holding the elements of the same columnis n � 1:

3. Using Proposition 1 we see that any processoru ¼ ðp1p2ypnÞ denoted by PR;C in SMD is connectedto the processor uðnÞ denoted by P

%R;

%C in LMD through

a link of type gn; where R ¼ pn; C ¼ Gn�1ðuÞ;

%R ¼ p1;

%C ¼ Gn�1ðuðnÞÞ:

4. In LMD, a processor u denoted by P%R;

%C is directly

connected to a processor v denoted by P%Rþ1;

%C with a

link of type gað%Rþ1Þ in Sn; where aðiÞ is the position

occupied by the symbol i in u: Similarly, a processorP

%R;

%C is directly connected to a processor P

%R�1;

%C with

a link of type gað%R�1Þ in Sn:

Since, the function SMD distributes matrix rows over n

disjoint n � 1-dimensional substars, simultaneous sub-column broadcasts are possible in SMD. In LMD the matrixcolumns are distributed over ðn � 1Þ! disjoint sets ofprocessors, where each set of processors forms a lineararray. Therefore, subrow broadcasts can be performedsimultaneously in LMD. So we see that LMD allows forefficient row broadcasts and SMD allows for efficientcolumn broadcasts. Also switching between SMD and LMD

requires a single submatrix exchange step. A group ofconsecutive rows or consecutive columns reside in thesame processor in SMD and LMD. This reduces commu-nication overhead involved in performing matrix opera-tions on consecutive rows or columns and hence weemploy SMD and LMD in our algorithm.

3.3. Our algorithm

In this section, we present our algorithm for solving alinear system of equations on the star graph. Ouralgorithm is based on the SGE algorithm and providesgreater concurrency than the AD algorithm. Further-more, our algorithm employs the communicationefficient SMD and LMD distribution techniques. Thesetwo features of our algorithm greatly enhance itsperformance and make it an attractive method forthe parallel solution of linear equations on the stargraph.

Initially, the N � 1 known vector b of the linearsystem Ax ¼ b; is appended to the N � N coefficientmatrix A to obtain the N � ðN þ 1Þ augmented matrix,E ¼ A j b ¼ fai;j j 1pipN; 1pjpN þ 1g: As a firststep toward developing our parallel algorithm on Sn;we need to identify the various tasks performed on E

during the course of our algorithm. We give below thesteps identifying the tasks performed during ouralgorithm.


As shown above, our algorithm requires log N stages tosolve a system of N linear equations. The current stageof the algorithm is denoted by ‘s’, 1psplog N: At thestage s; there are 2s�1 augmented submatrices each ofsize N

2s�1 � ð N2s�1 þ 1Þ; which have to be reduced. In the

above task listing, the submatrix currently being reducedin the forward direction is denoted by Es

num and thesubmatrix currently being reduced in the backwarddirection is denoted by E

0snum; where Es

num ¼ fai;j j boioe; bojpeg and E

0snum ¼ fa0

i;j j boioe; bojpeg: Thevariables ‘b’ and ‘e’ denote the beginning and end-ing matrix rows (and columns) of Es

num (and E0snum),

respectively. The reduction of Esnum and E

0snum during

the stage s of our algorithm requires N2s steps. The

forward tasks, backward tasks, duplication tasks, anddivision tasks are represented by F, B, D, and P,respectively.

After identifying the tasks as above, we need todistribute them among the processors of Sn: Theaugmented matrix E is distributed on Sn using thenon-cyclic matrix distribution techniques presentedabove. Our algorithm supports partial pivoting toensure numerical stability. In partial pivoting, the setof processors holding the pivot subcolumns find themaximum element in the subcolumn and exchangesubrows such that the maximum element becomes thepivot element.

At any step in the stage s; a processor reducing Esnum

and E0snum can be in any one of the following states:

* broadcasting pivot subrow or multiplier subcolumn,* eliminating a submatrix,* involved in determining the new pivot row,* exchanging submatrices,

* waiting for a pivot subrow and/or a multiplierssubcolumn, or

* idle holding final matrix elements.

Since similar steps are executed for both forward andbackward elimination, the states given above apply forboth the backward and forward elimination steps; i.e.,corresponding to each state mentioned above, we canhave one forward elimination state and one backwardelimination state.

Now, we specify the tasks executed by each processorofSn during the course of our algorithm. The algorithmpresented below outlines the sequence of operationsperformed by the processor u ¼ ðp1p2ypnÞ of Sn: Thematrix E is initially distributed using the LMD. In thestage s; we denote the N

2s�1 � ð N2s�1 þ 1Þ augmented

submatrix which is used for forward elimination, andwhich contains the lR � mC submatrix allotted to u inSMD, by Es

num; where 0pnumo2s�1 . Similarly, we denotethe forward elimination N

2s�1 � ð N2s�1 þ 1Þ augmented sub-

matrix which contains the l%R � m

%C submatrix allotted to

u in LMD, by Esnum; where 0pnumo2s�1: The copies of

Esnum and Es

num which are used for backward eliminationare denoted by E

0snum and E

0snum; respectively. For the

processor u; lR ðmcÞ and l%R ðm

%CÞ denote the number

of rows (columns) in SMD and LMD, respectively. Fur-thermore, bR ðbCÞ and b

%R ðb

%CÞ; respectively, denote

the starting row (column) in SMD and LMD of the sub-matrix assigned to the processor u: The proceduresbroadcast linear array and broadcast star performbroadcasting in the linear array and in the star graph,respectively. The procedure exchange submatrices isused to exchange submatrices between PR;C and P

%R;

%C

(i.e., between SMD and LMD) and vice versa.


The Algorithm


It is to be noted that the number of stages in the abovealgorithm executed by each processor of Sn is J1þlogðn � 1Þ!n and not log N as given in the task listing.The reason for this is as follows: in both the non-cyclicmatrix distribution techniques, Sn is viewed as a gridwith n processor rows and ðn � 1Þ! processor columns.So, at the end of J1þ logðn � 1Þ!n stages of ouralgorithm, the submatrices would become small enoughso that all computations on each of them would beperformed sequentially on a single processor. Since thissequential execution would not exploit the inherent

concurrency of our algorithm, we employ sequential GEalgorithm on these small submatrices during theremaining stages and obtain the final solution.

In the task listing, the forward partial pivoting taskwas denoted by /Fs

k;kðnumÞS and backward partialpivoting task by /Bs

k;kðnumÞS: Since both these areexecuted concurrently, they are together denoted by asingle compound task /Fs

k;kðnumÞ þ Bsk;kðnumÞS in the

above algorithm listed for the processor u: The task/Fs

k;kðnumÞ þ Bsk;kðnumÞS which performs partial pivot-

ing in our algorithm is outlined below.



The set of processors holding the elements of theforward elimination pivot subcolumn performs theforward partial pivoting. These processors are presentin a linear array (because it is the LMD distribution)and find the row, imax1; of the maximum elementof the subcolumn. The set of processors holdingthe pivot subrow and those holding the subrow imax1;then exchange the relevant subrows. In a similarfashion, the set of processors holding the elementsof the backward elimination pivot subcolumn concur-rently perform backward partial pivoting. The pro-cedures swap�subrows and interchange�subrows areused to swap rows within the same processor andbetween two different processors in a linear array,respectively.

4. Performance analysis

In this section, we evaluate the performance of ourstar-based algorithm presented above. We compare itwith the AD algorithm using the time estimation modelemployed in [3]. In the AD algorithm, the matrixis distributed using the cyclic SCMD and LCMD distributiontechniques [3], while in our algorithm, the matrix isdistributed using the non-cyclic SMD and LMD distri-bution techniques. The matrix size is N � ðN þ 1Þ: In allof the above techniques, the P processors in Sn arearranged as an R � C grid, where P ¼ n!; R ¼ n; andC ¼ ðn � 1Þ!:

In any GE-based algorithm using broadcastingapproach, overlapping of communication and com-putation can be done in two levels: intrastep andinterstep. In both the AD algorithm and our algorithm,

overlapping within each step does not reduceoverall execution time. However, interstep over-lapping reduces the overall execution time. Theeffect of interstep overlapping in each algorithmwould be described in the analysis of the respectivealgorithm.

4.1. Analysis of the AD algorithm

As mentioned before, the AD algorithm consists oftwo phases namely, the matrix-decomposition phase andthe back-substitution phase. The estimated executiontime of the AD algorithm would be the sum of theestimated execution times of each of the individualphases. The performance analysis of both the phases isgiven below.

4.1.1. Matrix-decomposition phase

The matrix-decomposition phase of the AD algorithmrequires N � 1 factorization steps to decompose amatrix A: In the kth step, the following tasks areperformed [3].

* /Tk;kS:* C simultaneous broadcasts in the wc sets ð1pcpCÞ

with message length equal to Nþ1C

; where wc denotesthe set of processors in the R � C grid along the cthprocessor column.

* NRsequential /Ti;kS:

* R simultaneous broadcasts in the rr sets ð1prpRÞwith message length equal to N

R; where rr denotes the

set of processors in the R � C grid along the rthprocessor row.

*NðNþ1Þ

Psequential /Ti;jS:


Thus, the maximum execution time of a singlefactorization step, tm; is given by

tm ¼fðTk;kÞ þ fðwcÞ þN

RfðTi;kÞ

þ fðrrÞ þNðN þ 1Þ

PfðTi;jÞ;

where fðTÞ indicates the time taken to execute thetask T :

Since locating the kth pivot row and computing thekth multipliers column can start as soon as the ðk � 1Þthmultipliers column is received, a new step can start every

tb ¼fðTk;kÞ þ fðwcÞ þN

RfðTi;kÞ

þ fðbrÞ þNðN þ 1Þ

PfðTi;jÞ;

where fðbrÞ is the time passed until the set of processorsholding the next multipliers column receives theprevious multipliers column, 0pfðbrÞpfðrrÞ: So, theestimated execution time tdecomp for the triangulariza-tion phase is

tdecomp ¼ ðN � 2Þtb þ tm:

4.1.2. Back-substitution phase

The back-substitution phase of the AD algorithmcomprises N steps. In the kth step the following tasksare performed:

* R simultaneous sends of c vector elements from theprocessors containing the ðk þ 1Þth matrix columnelements to processors containing the kth matrixcolumn elements, with message length equal to N

R: We

denote this task by r0:* /T 0

k;kS:* A single broadcast of xk along the processor set

containing the kth matrix column elements, withmessage length equal to 1. We denote this task by w0:

* NRsequential /T 0

i;kS:

So, the maximum execution time of a single step ofthe back-substitution phase, t0m is given by

t0m ¼ fðr0Þ þ fðT 0k;kÞ þ fðw0Þ þ N

RfðT 0

i;kÞ:

Since, computing the next variable xN�k�1 can bestarted as soon as the present variable xN�k’s value hasupdated cN�k�1; a new step can start every

t0b ¼ fðr0Þ þ fðT 0k;kÞ þ fðb0Þ þ N

RfðT 0

i;kÞ;

where fðb0Þ is the time taken for the value of xN�k toreach the processor containing cN�k�1; 0pfðb0Þpfðw0Þ:Hence, the estimated time, tback of the back-substitutionphase is

tback ¼ ðN � 1Þt0b þ t0m:

Therefore, the total estimated execution time of theAD algorithm is

tAD ¼ tdecomp þ tback:

4.2. Analysis of our algorithm

As mentioned before, our algorithm consists ofJ1þ logðn � 1Þ!n stages. In stage s; there are N

2s factor-ization steps. At the step k of the stage s; the followingtasks are performed.

* /Fsk;kðnumÞ þ Bs

k;kðnumÞS:* Two C simultaneous broadcasts, one in the forward

direction in the wc sets ð1pcpCÞ and the other in thebackward direction in the w0c sets ð1pcpCÞ withmessage length equal to ðNþ1Þ

Cin each, where wc and w0c

denote the set of processors along the cth column inthe R � C processor grid.

* Concurrent execution of NR

sequential /Fsi;kðnumÞS

and NRsequential /Bs

i;kðnumÞS:* Two R simultaneous broadcasts, one in the forward

direction in the rr sets ð1prpRÞ and the other in thebackward direction in the r0r sets ð1prpRÞ withmessage length equal to N

R; where rr and r0r denote the

set of processors along the rth row in the R � C

processor grid.* Sequential execution of

NðNþ1ÞP

sequential/Fs

i;jðnumÞS and NðNþ1ÞP

sequential /Bsi;jðnumÞS:

Hence, the maximum execution time for one step ofthe stage s in our algorithm is

tsm ¼fðF s

k;kðnumÞ þ Bsk;kðnumÞÞ

þ fðwc þ w0cÞ þN

RfðFs

i;kðnumÞÞ

þ fðrr þ r0rÞ þ 2NðN þ 1Þ

PfðF s

i;jðnumÞÞ:

Since, locating the kth pivot row and computing thekth multipliers column can be started as soon as theðk � 1Þth multipliers are received, a new step can startevery

tsb ¼fðFs

k;kðnumÞ þ Bsk;kðnumÞÞ

þ fðwc þ w0cÞ þN

RfðF s

i;kðnumÞÞ

þ fkðbr þ b0rÞ þ 2NðN þ 1Þ

PfðFs

i;jðnumÞÞ;

where fkðbr þ b0rÞ is the time passed until the set ofprocessors holding the next multiplier column receivesthe previous multiplier column values. Since in SMD andLMD, consecutive columns can reside in the sameprocessor, fkðbr þ b0rÞ is zero in such cases. It is non-zero only when consecutive elements reside in differentprocessors. Consequently, interstep overlap is muchhigher in our algorithm than in the AD algorithm. This


reduces the communication overhead in our algorithmand hence improves the performance of our algorithm.

Also at the end of the J1þ logðn � 1Þ!n stages,sequential GE is performed on the reduced N

C� N

C

submatrices. Let the time taken by the sequential GEbe tseq: Hence, the total estimated time taken by ouralgorithm is

tOUR ¼XJ1þlogðn�1Þ!n

s¼1

tsm þ

XN2s

k¼2

tsb

0B@

1CAþ tseq:

4.3. Comparison of the two algorithms

In estimating the execution time of our algorithm, weuse the same communication model employed in [3].This facilitates a ready and direct comparison of ouralgorithm with that of [3]. In this model, the time takenfor communication of a message with length M istcommðMÞ ¼ b þ aM; where ‘b’ is the message latencyand ‘a’ is the unit transmission cost. Also, the time takenfor broadcasting a message of length M in a graph ofdiameter d is dðb þ aMÞ: The time taken for twosimultaneous broadcasts each with a message of lengthM is taken to be 2dðb þ aMÞ: The time taken foreach computation tcomp is taken to be 0:25 ns andthe parameters a and b are set to 8 ns and 30 ms;respectively.

Using the above model, we plot the estimatedexecution times for the AD algorithm and our algorithmon Sn in Figs. 5–8 using different values of n; thedimension of the star. The x-axis in the plots gives theproblem size, i.e., the number of equations in the linearsystem and the y-axis gives the estimated execution timein nanoseconds. The figures suggest that our algorithmperforms better than the AD algorithm for a range ofproblem sizes. This improvement in performance is dueto the higher concurrency of our algorithm as compared

to the AD algorithm and the lower communicationoverheads of our non-cyclic matrix distribution meth-ods, the SMD and the LMD as compared to the cyclicmatrix distribution techniques, the SCMD and the LCMD.

Fig. 5. Estimated execution times for n ¼ 5:





5. Conclusions

In this paper, we presented a parallel algorithm forsolving a system of linear equations on the Sn: To thisend, we developed SMD and LMD, the non-cyclic matrixdistribution techniques on the star graph. We evaluatedour algorithm against the AD algorithm presented in [3]and demonstrated the superior performance of ouralgorithm. The proposed algorithm performs better dueto two main reasons, namely, the increased concurrencyof our algorithm and the lesser communication over-head in the non-cyclic matrix distributions. However,the algorithm has a few shortcomings. It takes twice theamount of memory than that of the AD algorithm, andit supports only partial pivoting. Since completepivoting is rarely used, the latter shortcoming is not aproblem in practice.

References

[1] S.B. Akers, D. Harel, B. Krishnamurthy, The star graph: an

attractive alternative to the n-cubes, Proceedings of International

Conference on Parallel Processing, 1987, pp. 393–400.

[2] S.B. Akers, B. Krishnamurthy, A group-theoretic model for

symmetric interconnection networks, IEEE Trans. Comput. 38 (4)

(1989) 555–566.

[3] A. Al-Ayyoub, K. Day, Matrix decomposition on the star graph,

IEEE Trans. Parallel Distrib. Systems 8 (8) (1997) 803–812.

[4] N. Bagherzadeh, M. Dowd, N. Nassif, Embedding an arbitrary

binary tree into the star graph, IEEE Trans. Comput. 45 (4)

(1996) 475–481.

[5] N. Bagherzadeh, N. Nassif, S. Latifi, A routing and broadcasting

scheme in faulty star graphs, IEEE Trans. Comput. 42 (11) (1993)

1398–1403.

[6] K.N.B. Balasubramanya Murthy, C. Siva RamMurthy, Gaussian

elimination based algorithm for solving linear equations on mesh

connected processors, IEE Proc. Comp. Digit. Tech. 143 (4)

(1996) 407–412.

[7] S. Beltayeb, B. Cong, M. Girou, I.H. Sudborough, Embedding

star networks into hypercubes, IEEE Trans. Comput. 45 (2)

(1996) 186–194.

[8] K. Day, A. Tripathi, A comparative study of topological

properties of hypercubes and star graphs, IEEE Trans. Parallel

Distrib. Systems 5 (1) (1994) 31–38.

[9] J.J. Dongarra, F.G. Gustavson, A. Karp, Implementing linear

algebra algorithms for dense matrices on a vector pipeline

machine, SIAM Rev. 26 (1) (1984) 91–112.

[10] P. Fragopoulou, S.G. Akl, A parallel algorithm for computing

Fourier transforms on the star graph, IEEE Trans. Parallel

Distrib. Systems 5 (5) (1994) 525–531.

[11] K. Gallivan, R.J. Plemmons, A.H. Sameh, Parallel algorithms for

dense linear algebra computations, SIAM Rev. 32 (1) (1990)

54–135.

[12] L. Gargano, U. Vaccaro, A. Vozella, Fault tolerant routing in the

star and pancake interconnection networks, Inform. Process. Lett.

45 (6) (1993) 315–320.

[13] D. Heller, A survey of parallel algorithms in numerical linear

algebra, SIAM Rev. 20 (4) (1978) 740–777.

[14] S. Jang-Ping, L. Wen-Hwa, C. Tzung-Shi, A broadcasting

algorithm in the star graph interconnection networks, Inform.

Process. Lett. 48 (5) (1993) 237–241.

[15] I.L. Jung, J.H. Chang, Embedding complete binary trees

in star graphs, J. Korea Inform. Sci. Soc. 21 (2) (1994)

407–415.

[16] J.S. Jwo, S. Lakshmivarahan, S.K. Dhall, Embedding of cycles

and grids in star graphs, J. Circuits Systems Comput. 1 (1) (1991)

43–74.

[17] S. Lakshmivarahan, S.K. Dhall, Analysis and Design of Parallel

Algorithms—Arithmetic and Matrix Problems, McGraw-Hill,

New York, 1990.

[18] S. Latifi, N. Bagherzadeh, Incomplete star: an incrementally

scalable network based on the star graph, IEEE Trans. Parallel

Distrib. Systems 5 (1) (1994) 97–102.

[19] A. Menn, A.K. Somani, An efficient sorting algorithm for the star

graph interconnection network, Proceedings of International

Conference on Parallel Processing 1990, Urbana-Champaign,

IL, USA, August 1990, Vol. 3, pp. 1–8.

[20] S.T. Obenaus, T.H. Szymanski, Embedding of star graphs into

optical meshes without bends, J. Parallel Distrib. Comput. 44 (2)

(1997) 97–107.

[21] W.H. Press, S.A. Teukolsky, W.T. Vetterling, B.P. Flannery,

Numerical Recipes in C—The Art of Scientific Computing,

Cambridge University Press, Cambridge, 1992.

[22] S. Rajasekaran, D.S.L. Wei, Selection, routing and sorting

on the star graph, J. Parallel Distrib. Comput. 41 (1) (1997)

225–234.

[23] S. Ranka, J.C. Wang, N. Yeh, Embedding meshes on the

star graph, J. Parallel Distrib. Comput. 19 (2) (1993)

131–135.

[24] Y.C. Tseng, J.P. Sheu, Toward optimal broadcast in a star graph

using multiple spanning trees, IEEE Trans. Comput. 46 (5) (1997)

593–599.


Documents

A faster algorithm for solving linear algebraic equations on the star graph