A parallel algorithm for evaluating general linear recurrence equationscesg.tamu.edu/wp-content/uploads/2012/02/90.-PARALLEL... · 2020. 3. 20. · 1. Introduction The general linear

CIRCUITS SYSTEMS SIGNAL PROCESSING VOL. 15, NO. 4, 1996, PP. 481-504

m PARALLEL ALGORITHM FOR EVALUATING GENERAL LINEAR RECURRENCE EQUATIONS*

Mi Lu, 1 Xiangzhen Qiao, 2, and Guanrong Chen 3

Abstract. A block parallel partitioning method for evaluating general linear mth order recurrence equations is presented and applied to solve the eigenvalues of symmetric tridiagonal matrices. The algorithm is based on partitioning, in a way that ensures load balance during computation. This method is applicable to both shared memory and distributed memory MIMD systems. The algorithm achieves a speedup of O (p) on a parallel computer with p-fold parallelism, which is linear and is greater than the existing results, and the data communication between processors is less than that required for other methods. For solving symmetric tridiagonal eigenvalue problems, our method is faster than the best previous results. The results were tested and evaluated on an MIMD machine, and were within 79% to 96% of the predicted performance for the second order linear recurrence problem.

1. Introduction

The general linear recurrence evaluation is a fundamental numerical computation that can be applied in many important scientific applications: partial differential equations, tridiagonal linear systems, polynomial evaluations, eigenvalue (eigenvector) problems, and digital signal processing. Usually these problems are compu- tationally intensive and require high-speed computations. Advances in high-speed parallel computers have allowed the solution of very large scientific problems that were not possible a few years ago. This trend will continue as more sophisticated scientific models and more efficient numerical methods are developed.

For the sake of simplicity, we use the following definitions and assumptions throughout this paper:

* Received May 14, 1991; revised September 14, 1994. Supported by the Texas Advanced Tech- nology Program under Grant No. 999903-165 and the National Science Foundation under Grant No. MIP8809328.

1 Department of Electrical Engineering, Texas A & M University, College Station, Texas 77843. 2 The Institute of Computing Technology, Chinese Academy of Science, Beijing 100080 China. 3 Department of Electrical Engineering, University of Houston, Houston, Texas 77204.

482 Lu, QIAO, AND CHEN

P n

tp

sp = t~/tp

E~ = Sp/p

number of processors number of items in the recurrence equation execution time (in terms of arithmetic operations) using p processors speedup of a parallel algorithm using p processors, over a uniprocessor algorithm efficiency of a parallel computation using p processors.

The general linear first order recurrence equation can be expressed as the de- termination of the sequence of xi:

Xl = b l ,

Xi = a i - l * X i - 1 + b i , i = 2 , 3 . . . . . n; (1)

where ai and bi (i = 1, 2 . . . . . n) are known. The parallel algorithm for solving equation (1) has been well investigated by many researchers [7], [9]. At present there are mainly three methods for solving equation (1) on vector and parallel computers. The first method includes recursive doubling [9], the second is cyclic reduction [3], [8], and the third is segmentation [6]. The recursive doubling method involves splitting the computation of equation (1) into two equally complex subfunctions whose evaluations can then be performed simultaneously on two separate processors. Successive splitting of each of these subfunctions spreads the computation over more processors. For a given problem with size n, its speedup should be O ( n / l o g n ) on a parallel computer with n-fold parallelism. The cyclic reduction method is efficient when it is used on vector computers; however, it has the following disadvantages when it is used on an MIMD machine: the parallelism is changeable during computation, and overhead is produced because of heavy data transformation. The segmentation method [6] is a divide-and-conquer method, which allows high parallelism to be achieved. However, the disadvantage of segmentation is the high overhead of data transformation. Other related work can be found in [5] and [10]-[12], [14].

Motivated by the versatility and popularity of the divide-and-conquer technique, this paper will develop a more powerful algorithm that combines the advantages of the segmentation and the cyclic reduction methods for solving general linear ruth order recurrence equations. For any n and p, the speedup is a linear function of p. For a general linear first order recurrence evaluation Sp = 0 . 4 , p, and for the general linear second order recurrence problem Sp = 2 / 7 , p. The principle exploited in our research was applied and tested on the Sequent Balance parallel processing system. The test results are within 62% to 93% of the predicted performance for the first order recurrence equation and within 79% to 96% for the second order recurrence equation.

This new method is used to solve symmetric tridiagonal eigenvalue problems. There are some parallel methods for solving symmetric tridiagonal eigenvalue problems. Barlow and Evans [1] proposed a parallel bisection method, where

ALGORITHM FOR LINEAR RECURRENCE EQUATIONS 483

the speedup is dependent upon the number of nonempty intervals available at each stage and bounded by the maximum number of eigenvalues. Combining the advantages of the multisection method [2] and parallel bisection method, Evans [4] developed a parallel multisection method. If p processors and m domains are given, l = p / m processors are located per domain. The speedup of his method lies between m and m * log2(l + 1). Evans [4] has also adopted the recursive doubling [9] method to the parallel computation of the Sturm sequence. The speedup is about p / ( 4 - l o g 2 n). Here we apply the new parallel method (with linear speedup) to solve the symmetric tridiagonal eigenvalue problem. The speedup of our new algorithm is 0 . 4 , p. The test results are within 62% to 98% of the predicted performance.

The rest of this paper is organized as follows. In Section 2 the formal devel- opment of the new parallel algorithm is given. In Sections 2.2 and 2.3, the new method is applied, respectively, to the first and second order recurrence problems. A complexity analysis of the algorithm is considered in Section 2.4. Its application to symmetric tridiagonal eigenvalue problems is addressed in Section 3. Experi- mental results are presented in Section 4, and the paper concludes with Section 5.

2. The new parallel algorithm

We are given a general linear mth order recurrence equation:

x l = b l ,

X i ~-. ai_l, 1 * x i - 1 --1- ai_l, 2 * x i _ 2 + . . . -Jr- ai_l, m * x i _ m + bi,

L e t

I Xi

Xi-1 X i = . ,

I-Xi-m+l

x = bo, �9 �9 , X2-m = b2-m,

i = 2 , 3 . . . . . n.

tgi_ 1

I aiil,1 ai-1,2 �9 �9 �9 ai- l ,m-1 ai- l ,m 0 .-. 0 0 1 . . . . . . 0

. . .

1 0

n i Ebll i = 2 , . . . , n ,

(2)


and

bo B1 ~

b2i_m

By using the vector-matrix form, we have

X1 = B1, (3)

X i : Oi-1 * X i - 1 + Bi; i = 2 . . . . . n.

Our new parallel method for solving equation (1) is a combination of the cyclic reduction and segmentation methods. The basic idea of the cyclic reduction method is to combine adjacent terms of the recurrence together so as to obtain a relation between every other term in the sequence; that is, to relate xj, say, to xj-2. This relation is also a linear first order recurrence, although the coefficients are different and the relation is between alternate terms. Consequently, the process can be repeated to obtain a recurrence relating every fourth term, every eight term, and so on. When a recurrence is obtained that relates every nth term (i.e., after log 2 n levels of reduction), the value at each point in the sequence is related only to values outside the range that are known or equal to zero; hence, the solution has been found.

Next we describe the parallel algorithm for solving the general linear mth order recurrence problem.

2.1. New parallel method

For solving equations (2) on an MIMD machine with parallelism p , we divide {Xi}, i = 1, 2 . . . . . n, into p groups each with k = n i p vectors. For the sake of simplicity and without loss of generality, we assume that k is an integer. This is the segmentation approach. Now equation (2) is divided into p groups each with k equations. The / th group consists of the following equations:

Xl,,k+j+l = l~b,k+j * Xl.k+j "~ Bt.~+j+I,

w h e r e 0 < l < p - l a n d l < j _ < k . Instead of combining adjacent terms, here we try to relate every term with the

first in the same group,

and

Xl.k+ 2 ~ bgl.k+ 1 "Jr Xi.k+ 1 "a t- Bl.k+2,

Xl,k+3 = tgl.k+2 * Xl.k+2 + Bl.k+3

= /)/*k+2 * (/-91*k+l * Xl.k+l q- Bl,,k+2) q- Bh, k+3

= t~l.k+2 * tgl.k+l * Xl.k+l -}- (tg/.k+2 -'~ Bl.k+2 "~ Bt,k+3)

= 0 -(1) B O) /*k+l ~ Xl*k+l ~ /*k+3'


where

Hence,

where

•(1) l*k+l ~ ~91.k+2 * 0l*k+1'

B o ) l.k+3 = 01.k+2 * Bt.k+2 + Bl.k+3"

Xl.k+4 -~- Ol.k+3 * Xl.k+3 -+ Bl.k+4

= 0/.k+ 3 * (0(1)+1 * Xl,,k+l -~- o~.lk)+3) q- O/.k+ 4 ,q(1)

= 01.k+3 * ~t.k+1 * Xt.k+1 + (Or.k+3 * B(a) t.k+3 + Bl*k+4) _ ,~(2) B (2) - - ~' l*k+l * Xl*k+l ~ l . k + 4 '

0(?k)+ 1 = 01.k+3 * O(lk)+ 1,

B(2) /~(1) /.k+4 ~ bgl.k+3 * ~l .k+3 "~- Bl*k+4"

Using the method of induction, we have

Xt.k+j = 0t.k+j-a * Xt.k+j-1 + Bt.k+j (,q -(j-3) + B (j-3) -

= Ol.k+j-1 * ~ l * k + l *XI.k+l l.k+j-1) -'1- Bl.k+j = " • ( j -3)

Ol,k+j-1 * O~Jk+31 ) * Xl.k+l -}- (OI.k+j-1 * ~l.k+j-1 + Bl.k+j)

= O~j~+] ~ * Xt.k+ 1 + B[j;2} , j = 2, 3 . . . . . k + 1.

,q(-a) = I , and n(-1) = 0. We can n(O) ~ Ol.k+j, ~l,*k+l ~l*k+j Assume ~l*k+l'q(O) : 01.k+1 ' Dl*k+J obtain the fol lowing relations:

- B ( j-2) (4) x . k + j = ~162 * xt.k+~ + .k+j,

where ,o,(j-2) D( j -3) ~l*k+l ~ Ol*k+j-1 * ~l*k+l B(J_2) R( j_3) , (5)

l*k+j = Ol*k+j-1 * ~l*k+j-1 "q- Ol*k+j

0 < l < p - 1 , 2 < j < k + l ,

and equation (3) becomes a smaller problem, that needs fewer recurrent computations. Let j = k + 1; from equation (4) we can get the fol lowing equations:

= 0 ( k - l ) B(k-1) X(t+l).k+l t.k+l * Xl.k+a + (t+a).k+a, 0 < l < p -- 2, (6)

where the coefficients are obtained from equation (5). This is also a linear recurrence problem, and the computation is performed on different processors. It can be calculated by using the traditional cyclic reduction method. When Xt.k+l's are calculated, the remaining (n - p ) X's can be obtained from equation (4) (or from equation (2)). Thus, the computational process consists o f three phases:

�9 n(J-a) from (5). This computation is parallel for l (i) Calculate O~j~+l) 1 and "-'t.k+j (0 _< l < p -- 1), but sequential for j (2 _< j < k).


(ii) Calculate X(t+l),k+l (0 < l < p - - 1) in parallel by using the cyclic reduction method:

- - 0 (k- l ) o ( k - 1 ) (Xa is known). X( l+l ) . k+l ~ l .k+l * X l . k + l -[- ~

When these vectors are calculated, the results should be transferred between processors in a distributed memory MIMD system. The data communication needed is only a vector; i.e., m - 1 data for an mth order linear recurrence problem.

(iii) Calculate Xt,k+j, 0 <_ l <_ p -- 1, 2 < j < k, using equation (4) or (2). This can be done in parallel for I on p processors, but sequentially for j .

2.2. General linear first order recurrence

The general linear first order recurrence equation can be expressed as the determi- nation of the sequence of xi:

Xl = bl, x i = a i _ l . X i _ l + b i ; i = 2 , 3 . . . . . n, (7)

where ai and bi (i = 1, 2 . . . . . n) are known. The above parallel method can be used to solve equation (7) on a parallel

computer. It is not difficult, from equations (4)-(6), to get the following three phases to evaluate equation (7).

(i) Calculate Ut,k+ 1-('/-1) and b~.J~-+l~, where

( j - l ) a ( J -2 ) al .k+l = a l . k+j ~ l . k+ l ,

and (j--l) -k h(J-2)

bh, k+ j : al .k+ j Ul .k+j-1 -}- bl .k+ j .

This computation is parallel for l (0 < l < p - 1), but sequential for j (2 <_ j _< k).

(ii) Calculate X(l+a).k+l (0 <_ I < p -- I) in parallel by using the cyclic reduction method,

~(k-1) ?,(k-l) X(l+l)~,k+l .~. U/.k+ 1 'tr Xl,,k+l + v( /+ l ) .k+ l , (X 1 is known).

When these values are calculated, the results should be transferred between processors in a distributed memory MIMD system.

(iii) Calculate Xt.k+j, where

__ (j--2) /.,(j--2) Xi.k+j = Ul.k+ 1 ~r Xl.k+ 1 -~- ~l*k+j '

o r

Xl.k+j = al~k+j-1 * X t . k+ j -1 q- bt.k+). This can be done in parallel for l (0 < l < p - 1) on p processors, but sequentially for j (2 < j _< k).


From the above analysis, the parallel procedure for evaluating the linear first order recurrence problem can be given in the following algorithm.

Algori thm 1

Input: Constants ai, bi, where 1 < i < n.

Output : Sequence Xl, x2, x3 . . . . . x . .

A l g o r i t h m :

(i) Divide {al, a2, a3 . . . . . a.} and {bl, b2, b3 . . . . . b.} into p parts and assign the lth part on processor l (0 _< l < p - 1).

(ii) for l = 0 to p - 1 para l le l do

a(J-l) a(J -2) I.k+1 ~ a l .k+j * l .k+l

b(j-1) h(j-2) l .k+j = al .k+j * Vl .k+j -1 Jr b l . k+j ,

j = 2 , 3 , . . . , k ,

where _ (o)

b (~ = bt.k+l, and at.k+ ] = a l . k + 1 . I*k+l

(iii) for l = 0 to p - 2 do (can be parallelized by using the cyclic reduction method)

(k--l) b(k-1) X(l+l).k+ 1 .~ al.k+ 1 * Xl.k+ 1 Jr (l+l).k+l~

where Xl = 51. (iv) for l = 0 to p -- 1 parallel do

or

Xt.k+j = a l . k+ j -1 * Xl .k+j -1 + bl .k+j

(j-e) �9 x . + h(J-2) Xl*k+j = al .k+l I k+l Vl*k+j

j = 2 , 3 . . . . . k.

2.3. General linear second order recurrence

The general linear second order recurrence equation can be expressed as the de- termination of the sequence of xi:

xl = b l , x o = bo,

xi = a i - l , l * X i - l j r a i - l , 2 * x i - 2 j r b i ; i = 2 , 3 . . . . . n, (8)

488 L u , QIAO, AND CHEN

where ai and bi (i = 1, 2, . . . , n) are we have the following relations. Let

ei-1 = a i - l ,1 , di-1 ----- a i - l ,2 ,

x/ [x / ] Xi-1

o, x=[e: 0 '

i = 2 , 3 . . . . . n

and

B1 = b o "

By using the vector-matrix form, we have

X1 = B1,

X i = 0 i -1 • X i - 1 --~ Bi ,

Define

known. Using the same notation as in 2.1,

i = 2 , . . . , n .

e ( - 1 ) = 1, (-1) ;,(-1) = O, l*k+l dl.k+ 1 ~ O, '~l.k+j

e(O) d(O) h(O) l .k+l ---- el.k+l, ~l*k+l = dl.k+l, ~l.k+j = bl .k+j ,

(9)

The parallel method mentioned in Section 2.1 can be used to solve equation (8). It is not difficult, from equations (4)-(6), to get the following four phases for solving the second order recurrence problem.

(i) Calculate e~Jk-+l}, d(J-1) and h(J--1) " ~ l * k + l ' ~l*k+j+l"

e(J-1) _(j-2) * _( j -3) l .k+l : el.k+j * el*k+1 + dl.k+j e l . k+ l ,

dl(J-1) . 2 ( j - 2 ) . . /(j-3) . k+ l ~--- el.k+j ~ Ul.k+l q'- dl.k+j * "*l.k+l,

b (J -1) . h ( j - 2 ) * h ( j - 3 ) l*k+j+l : el.k+j ~l.k+j + dl.k+j Ul.k+j-1 '}- bl.k+j+l"

This computation is parallel for l (0 < l < p - 1), but sequential for j (2 _< j <_ k).

(ii) Calculate x(/+l),k+l and Xq+l),k (0 < l < p - i) in parallel by using the cyclic reduction method,

~(k-1) , . t(k-1) h(k-1) X(l+l)*k+l ~ ~l*k+l * Xl*k+l "F Ul.k+ 1 ~r Xl. k -~- ~(l+l )*k+P

(k-2) ../_ d (k-2) h(k-2) X(l+l)*k ~ el.k+l * Xl.k+l ~ ~/ .k+l ~ Xl.k "~- ~(l+l) .k

(x0 and Xl are known).

When the values are calculated, the results should be transferred between processors in a distributed memory MIMD system. Only one data item is needed to transfer between processors.


(iii) Calculate Xt,x+j,

_(j-2) d ( j -2) ..L. h ( j -2) Xh, k+j = el*k+l * Xl*k+l "]" "l*k+l * XI*k i '~l*k+j '

or

Xl .k+j = e l . k + j - 1 * X l . k+ j -1 + d l . k + j - 1 * X l . k + j - 2 + b l . k+ j ,

0 < l < p - 1 , 2 < j < k - 1 ,

this can be done in parallel for l on p processors, but sequentially for j .

(iv) x . = en-1 * Xn-1 + dn-1 * Xn-2 + b. .

From the above analysis, the parallel procedure for evaluating the linear second order recurrence equation can be given in the following algorithm.

Algor i thm 2

Input: Constants bo, el, di, and bi (1 < i < n).

Output: Sequence Xo, Xl . . . . . x. .

Algorithm:

(i) Divide {el, e2, e3, . �9 �9 en}, {da, d2, d3 . . . . . dn}, and {bl, b2, b3 . . . . . bn} into p parts and assign the lth part on processor I (0 < l < p - 1).

(ii) for l = 0 to p - I parallel do

e(J-1) ^( j -2) d . . ( j - 3 ) l*k+l ~ e l . k+j * el .k+l -~- I .k+j ~l.k+l,

dl(J-1) ( j -2) d ( j - 3 ) �9 k+l = el.k+y * d/ .k+l + dt.k+j * "t.k+l,

( j - l ) b(J -2) h ( j -3 ) bt.k+j+l = el.k+y * t.k+j + dl.k+y * ~t.k+y-1 + bl.k+j+l,

j = 2 , 3 . . . . . k,

where e(-1) 1 d (-1) = O, (-1)

l*k+l = *~ ~l*k+l bl .k+ 1 = O, a n d

e:.~ = el.k+ 1, dt~~ = dl.k+ 1, b~~ = hi.k+ j �9

(iii) for l = 0 to p -- 2 do (can be parallelized by using the cyclic reduction method)

_<k- , + . x , . k + I, (k-" X(l+l).k+ 1 ~ t~l.k+ 1 * Xl.k+ 1 V(l+l).k+l,

_ ( k - 2 ) . _ . ( k - 2 ) . b(k-2) X(l+l).k = r a l .k+l + a / .k+l Xl.k "~ (l+l).k,

where Xo = b0, xl = bl. (iv) for l = 0 to p - 1 parallel do

xt .k+j = el .k+j-1 * xt ,k+j-1 + dl.k+j-1 * Xl.k+j-2 * bt.k+j

or xt.k+ j = e~J~-+2~ . xt.k+ , + d~jk+21 ) * x t . k + b~J.k-2+~,

j = 2 , 3 . . . . . k - 1 .

(v) Xn = e.-1 * Xn--1 "~ dn-1 * Xn--2 -~ b n.


2.4. Complexity analysis

Now we analyze the time complexity of Algorithm 1 and Algorithm 2.

2.4.1. Linear first order recurrence. The time needed for evaluating a linear first order recurrence equation with n elements on one processor is

TI = ( T, + T+ ) * n ,

where T+ and T, are the times needed, respectively, for one operation of addition and multiplication. The time needed for Algorithm 1 on a parallel machine with parallelism p should be:

Tp : (k - 1) * (2* T, + T+) + 2 l o g p * (T, + T+) + (k - 2) * (T, + T+).

Assume T, = T+, and notice that usually k > 2; thus, we have

Tp ~ k . ( 3 . T. + 2 - T+) + 2 logp(T , + T + )

-~ 5 . k . T+ + 4 . 1 o g p . T+,

and __ T I ~ 2 * n * T+

Sp = Tp 5k * T+ + 4 .1og p . T+

,~ 0 . 4 . p / (1 + 0.8 * p . log p /n ) .

The term 0 . 8 . p * logp /n represents the sequential fraction--nonparallel part of the computation. When n >> p, Sp ~ 0.4 * p, and Ep ~ 40%. In this case, this term can be ignored, and the speedup is a linear function of the number of processors.

When k = 2, n = 2p,

hence

W h e n k = 1, n = p,

hence

Tp .~ 3T+ + 4 l o g p * T+,

Sp ~ 4p / (3 + 4 logp) .

Tp ~ 4 l o g p . T+,

Sp ~ p / ( 2 . 1 o g p ) .

2.4.2. Linear second order recurrence. Using the same analysis as in the last section we can obtain the following results. The time needed for evaluating a linear second order recurrence equation with n elements on one processor is

T1 = ( 2 * T . + 2 * T + ) * n .

The time needed for Algorithm 2 on a parallel machine with parallelism p should be:

Tp = ( k - 1 ) . ( 6 . T . + 4 . T+)+ 2 log p . ( 2 . T + + 2 . T + ) + 2 . ( k - 2 ) . ( T . + T+).


Using the same assumption we have:

Tp ~ k * ( 8 * T . + 6 . T+) + 2 l o g p ( 2 . T . + 2 . T + )

1 4 - k * T+ + 8 * l o g p . T+,

and Sp = __T1 ~ 4 * n - T +

Tp 14k . T§ + 8 .1og p . T+

2 * p / (7 + 4 * p * log p /n )

2/7 * p

(1 + 4 / 7 - p * log p / n ) "

The term 4 / 7 . p . l o g p / n represents the sequential fraction. When n >> p, Sp ~. 2/7 * p, and Ep ~. 29%, the speedup is a linear function of the number of processors.

Whenk = 2, n = 2 . p ,

Tp ~ 10*T+ + 8 - l o g p * T+,

hence 8 * p

-- ~ ( 4 - p ) / ( 5 + 4 * l o g p ) . Sp 10 + 8 * logp

When k = 1, n = p , Tp = 8 . 1 o g p . T+,

hence 4p p

S p - - - - 8 * log p 2 * log p

One can easily find that for a general linear mth order recurrence problem, the efficiency and speedup decrease with order m. For the widely used first and second order linear recurrence equations it is satisfied�9 In the next section we apply this to symmetric tridiagonal eigenvalue problems.

3. The symmetric tridiagonal eigenvalue problem

Given a real symmetric tridiagonal matrix

Cl b2

b2 c2 ".

A ~ "�9 " , . " . .

"- ". b//

bn Cn

our purpose is to solve the eigenvalue problem

A * X = X * X ,

(10)

(11)


where )~ is an eigenvalue of A, and X is its corresponding eigenvector. When only a few eigenvalues and/or eigenvectors are desired, the bisection method based on a Sturm sequence [15] is appropriate.

Let A r denote the leading r - b y - r principal submatrix of A, and define the polynomials f0, f l . . . . . fn by

f 0 = l ,

fr = det(ar - ~.I),

for r = 1, 2, . . . , n. A simple determinantal expression can be used to show that:

f l ()-) ~--- Cl -- )., (12)

f~(;L) = (c~ - )0 * f r - l (X) - b 2 * f ~ - 2 0 0 ,

r ---- 2, . . . , n .

Because f , ()0 can be evaluated on O (n) flops, it is feasible to find its roots using the bisection technique. When the kth largest eigenvalue of A (k is an integer) is desired, the bisection idea and the Sturm sequence property can be adopted. It is pointed out that if the tridiagonal matrix A in (10) is unreduced, then the eigenvalues of At-1 strictly separate the eigenvalues of Ar :

) . r ( A r ) < ~ . r - l ( A r - 1 ) < ~ . r - l ( A r ) < ' ' ' < ~.2(Ar) < )-l(Ar-1) < )~l(Ar).

Moreover, if q(~.) denotes the number of sign changes in the sequence:

{fo(x), f~ (;~) . . . . . L(;~)},

then q (~.) equals the number of A 's eigenvalues that are less than )~. (Convention: f~ (~.) has the opposite sign from f r - 1 (L) if fr 00 = 0.)

Let go 1 g~ = c~ - L, i = 2 ,3 . . . . . n, (13) gi = (Ci - - ~-) "* g i - 1 - - b E * g i - 2

where gi = f i ()0, i = 0, 1, 2 . . . . . n. By using the Sturm sequence method, the solution involves the repeated computing of the above recurrence equation (13). For different values of )~, the number of negative g i ' s , i = 0, 1, 2 . . . . . n, of the above sequences will indicate the number of eigenvalues below the sample point ~.. Repeated application of this procedure will separate the eigenvalue spectrum into small subintervals of size e (predefined) that contain one or more eigenvalues ~..

Let

where

[:10] ] Oi-a = [ ci 1

e i _ 1 = c i - - ~., d i _ 1 = - b 2,


then equation (13) can be written as

[ i]=Oil.rgi l gi-1 k gi-2 ]

Let

then

i = 2 . . . . . n.

Obviously,

and

X i ~ . O i _ l * X i _ l , i = 2 . . . . . n.

This is a general linear recurrence problem. It is a special case of equation (3), where Bi = 0 (i > 2), and m = 2. So the new parallel method in the last section can be used to solve the symmetric tridiagonal eigenvalue problem. From

P -1 equations (4)--(6)and (14), if we divide vectors Xi = I gi | into p groups each

/ gi-1 _1 with k vectors, and define:

e(-1) = 1, d( -1) = O, l.k+ 1 '*l.k+ 1

e(O) d(O) l.k+j = et.k+j, l.k+j : dl.k+j,

we have { ~(j--2) * _(j--3) __ a (j--4)

el*k+1 ~ el.k+j--1 ~l*k+l -1" Rl*k+j-1 * el.k+1, d ( j-2) . d ( j -3) ._t_ .4 . d ( j-4)

l .k+l ~ el~k+j-1 ~l.k+l J ~l.k+j-1 ~l.k+l,

[ ~(j-2) d (j-2) 1 D(j-2) el.k+1 ~/*k+l [ ~l.k+l = _0-3) d(J-3~ l ,

e:l*k+l l*k+l -J

1 = 0 , 1 . . . . . p - l , and j = 2 , 3 . . . . . k + l .

Bl.k+j = O.

Substituting 0~.k~1~ into equation (6), we have

X(l+l)*k+l = L[ &t+l>k+l&t+l>~ ]

.~(k-1) -A- [ gl*k+l ] (15) = Ul*k+l - [ gl*k '

l = 0 , 1 . . . . . p - 2 .

As gl and go are known, we can calculate all the g0+l).k+l 's and g0+l).k's from equation (15). Subsequently, all the other g/ 's can be calculated using the following equations.

Xt.k+j = Ol.k+j-1 * Xl.k+/-1 (16)

or

Xl.k+ j ~-. b~yL 2) * Xl .k+l, (17)

xi=[gil i] xl 14,


j = 2 , 3 . . . . . k.

In fact this can be simply done by using the following formulas:

gl .k+j = el .k+j-1 * g l .k+j -1 -1- d l .k+j-1 * g l . k+j -2 (18)

or : e( j -2) ,,4(J-2) gt.k+y t.k+l * gt.~+a + ~t.k+l * gt.k" (19)

This calculation can be done in parallel by the traditional cyclic reduction method. From the above analysis, the parallel procedure for determining the Sturm

sequence at point ~. can be given in the following algorithm.

Algor i thm 3

Input:

(i) Sample point ~.. (ii) A symmetric tridiagonal matrix.

[ ;: A ~ �9149149

and its order n. Suppose n = p * k.

b2

"�9 bn

bn Cn

Output: Sequence {f0(~-), f1(3.) . . . . . fn()~)}, where fi(~) 's , (0 < i < n), are defined

as above�9

A l g o r i t h m :

(i) Divide {Q, c2, c3 . . . . . cn } and { 1, d2, d3 . . . . . dn } into p parts and assign the lth part on processor l (0 _< l < p - 1), where di = - b ~ (2 < i < n).

(ii) for l = 0 to p - 1 paral le l do

et.k+i = ct.k+j+a -- 3 . 0 < j < k - 1.

(iii) for l = 0 to p - 1 paral le l do

e( j -D . _(]-2) _(i-3) l.k+1 : el.k+] el*k+1 -~- dl*k+j * gl*k+l,

d ( J - 1 ) . . , /(i-2) . 2 ( i - 3 ) 1.k+j ~--- el.k+j ~l.k+l '~ dl*k+j ~x Ul.k+ 1 ,

j = 2 , 3 . . . . . k

where dt (-1) = 0, and _(-1) = 1. *k+l ~/*k+l


(iv) for I = 0 to p - 1 do (parallel by using the cyclic reduction method)

~(k-1) ~ d(k-1) g(t+l).k+l = e l . k + l * g / . k + l - - l . k + l .k gl.k,

(k-2) --I- r / (k -2) g(l+l).k = el*k+1 * gl.k+l -- W . k + l * gl.k"

(v) for I = 0 to p - 1 paral lel do

gl.k+j = et.k+j-1 * gt.k+y-a + dt.k+j-a * gt.k+j-2

o r _ ( j - 2 ) 2- . / ( j - 2 )

gt.k+j = e l . k + l * gl.k+l - -~ l .k+l * gh*k,

j = 2 , 3 . . . . . k - 1 .

(vi) f l .k+j(Z) = gt.k+j-

Using Algorithm 3, we can obtain the parallel algorithm for determining an eigenvalue of a symmetric tridiagonal matrix. For the sake of simplicity and without loss of generality, we only calculate the largest eigenvalue of a symmetric tridiagonal matrix.

Algorithm 4

Input: (i) A symmetric tridiagonal matrix A, as in Algorithm 3, and its order n.

(ii) An interval [~1, ~2] containing the largest eigenvalue ~. and a small positive number e.

Output : The largest eigenvalue L satisfying the accuracy e.

Algorithm: (i) Calculate sequence {f000, f l ( )0 . . . . . fn(Z)}, ~- = 0~1 + )~2)/2, usingAl-

gorithm 3. (ii) Calculate the number of sign changes in the above sequence, donated by

q0-). If q()0 = n, let Z2 = ~; if q0 . ) = n - 1, let ~-1 = )~. (iii) If 1)~1 - )~21 _> e go to (i). (iv))~ = (~-1 + )~2)/2 is the largest eigenvalue.

Next we analyze the time complexity of Algorithm 3 and its efficiency. The time needed for evaluating a Sturm sequence on one processor is

/'1 = (2T. + 2T+) * n.

The time needed for Algorithm 3 on a parallel machine with parallelism p is:

Tp = k . T+ + ( k - 1 ) . ( 4 . T. + 2 . T+ ) + 2 log p . ( 2 . T. + T+ ) + ( k - 2) .(2T. + T+ ).

Tp ~ k * ( 6 - T. + 4 . T+) + 2 l o g p ( 2 * T. + T+)

l O . k . T+ + 6 . 1 o g p . T+,


and T1 ~ 4 * n . T+

Sp = ~pp lOk , T+ + 6 * log p * T+

0.4 * p / ( l + 0.6 * p * l o g p / n ) .

When n >> p, Sp = 0.4 * p, and Ep = 40%, the speedup is a linear function of the number of processors. When k = 2, n = 2p,

Tp = 8T+ + 6 l o g p * T+,

Sp = 4p/ (4 + 3 . logp).

Whenk = 1, n = p,

Tp = (1 + 6 . logp)T+,

2 Sp --- 4p/(1 + 6 . logp) ~ -~ (p / logp) .

4. Experimental results

We have run test programs on the Sequent Balance 8000 parallel processing system. This is a shared memory MIMD machine [13]. The system has 10 processors; one is dedicated to the operating system. This is a practical multiprocessor system. It incorporates multiple identical processors (CPUs) and a single common memory. The Sequent CPUs are general-purpose, 32-bit microprocessors. They are tightly coupled. All processors share a single pool of memory to enhance resource sharing and communication among different processors.

Sequent systems support two basic kinds of parallel programming: multiprogramming and multitasking. Multiprogramming is an operating system feature that allows a computer to execute unrelated programs concurrently. Multitasking is a programming technique that allows a single application to consist of multiple processes executing concurrently. Multitasking yields improved execution speed for the individual program. The Sequent system supports multitasking by allowing a single application to consist of multiple, closely cooperating processes. The Sequent language software includes multitasking extension to C, Pascal, and FORTRAN. The Parallel Programming Library supports the multitasking. Sequent FORTRAN includes a set of special directives for parallel loops. With these directives, we mark the loop to be executed in parallel and classify loop variables so that data is passed correctly between loop iterations. The FORTRAN compiler interprets the directives and restructures the source code for data partitioning. In addition to these directives and the Parallel Programming Library, the difference between a sequential and parallel program is distinguished by the compiler option -mp.

In adopting an application for multitasking, we often have the following goals: run as much of the program in parallel as possible and balance the computational


Table 1. Test results for first order recurrence

p(Number of Tp Sp Ep (%) 0 (%) processors) (in seconds) (Speedup) (Efficiency)

960 1 0.034 1.00 100.0 100.0 4 0.027 1.26 31.48 78.70 5 0.022 1.55 30.91 77.27 6 0.020 1.70 28.33 70.83 8 0.017 2.00 25.00 62.50

1920 1 0.068 1.00 100.0 100.0 4 0.050 1.36 34.00 85.00 5 0.042 1.62 32.38 80.95 6 0.038 1.79 29.82 74.56 8 0.033 2.06 25.76 64.39

3840 1 0.137 1.00 100.0 100.0 4 0.098 1.40 34.95 87.37 5 0.079 1.73 34.68 86.71 6 0.070 1.96 32.62 81.55 8 0.063 2.17 27.18 67.96

7680 1 0.271 1.00 100.0 100.0 4 0.190 1.43 35.66 89.14 5 0.152 1.78 35.66 89.14 6 0.138 1.96 32.73 81.81 8 0.115 2.36 29.46 73.64

11520 1 0.410 1.00 100.0 100.0 4 0.280 1.46 36.61 91.52 5 0.228 1.80 35.96 89.71 6 0.200 2.05 34.17 85.42 8 0.160 2.56 32.03 80.08

15360 1 0.551 1.00 100.0 100.0 4 0.370 1.49 37.23 93.07 5 0.300 1.84 36.73 91.83 6 0.260 2.12 35.32 88.30 8 0.200 2.76 34.44 86.09

load as evenly as possible among parallel processes. For efficient multitasking we have to choose the right programming method. Most applications naturally lend themselves to one of two multiprogramming methods: data partitioning and function partitioning. Data partitioning involves creating multiple, identical processes and assigning a portion of the data to each process. Function partitioning, on the other hand, involves creating multiple unique processes and having them simultaneously perform different operations on a shared data set. It can be seen that data partitioning is appropriate for our new parallel algorithm. It is done by executing the outer loop iterations in parallel in steps (ii) and (iv) of Algorithm 1 and Algorithm 2, and in steps (ii), (iii), and (v) of Algorithm 3.

For different size of n, we evaluate linear first and second order recurrence equations by using the parallel Algorithms 1 and 2 as well as their correspond-

498 LtJ, QIAO, AND CHEN

Table 2. Test results for second order recurrence

n p(Number of Tp Sp Ep (%) 9 (%) processors) (in seconds) (Speedup) (Efficiency)

1920 1 0.110 1.00 100.0 100.0 6 0.075 1.47 24.44 84.29 8 0.060 1.83 22.92 79.02

3840 1 0.220 1.00 100.0 100.0 6 0.145 1.52 25.29 87.20 8 0.115 1.91 23.91 82.46

7680 1 0.440 1.00 100.0 100.0 6 0.270 1.63 27.16 93.66 8 0.215 2.05 25.58 88.21

11520 1 0.685 1.00 100.0 100.0 6 0.415 1.65 27.51 94.86 8 0.320 2.14 26.76 92.27

15360 1 0.890 1.00 100.0 100.0 6 0.530 1.68 27.99 96.51 8 0.410 2.17 27.13 93.57

ing sequential evaluation. The parallelism is achieved by using the Parallel Pro- gramming Library function m_fork on the Sequent B8000. Different numbers of processors are used, and their CPU times are recorded. The measured time, the speedup, and the efficiency for different n (matrix size) and p (number of processors) are given in Tables 1 and 2. In each case, 5 runs were made, and the measured time represents the average of the 3 experiments obtained by discarding the largest and the smallest figures. It can be seen that, for the linear first order recurrence problem, the achieved results for the speedup and efficiency are within 62% and 93% of the predicted results (see r/in Table 1). For a fixed n, the efficiency decreases as the number of processors increases. This could be attributed to the sequential work and system overhead. For a fixed p, the efficiency increases with n. The speedup versus the number of processors is plotted in Figure l(a) and that versus the problem size in Figure l(b). For p ---- 4, the theoretical speedup should be 1.6. The practical results are between 1.26 and 1.49. They are within 78.70% and 93.07% of the predicted results. For p -~ 8, the theoretical speedup should be 3.2. The practical results are between 2.00 and 2.76. They are within 62.50% and 86.09% of the predicted results. This is because the influence of sequential computation and system overhead increase with p.

For the linear second order recurrence problem, the achieved results for the speedup and efficiency are within 70% and 96% of the predicted results (see r/ in Table 2). For a fixed n, the efficiency decreases as the number of processors increases. For a fixed p, the efficiency increases with n. The speedup versus the number of processors is plotted in Figure 2(a) and that versus the problem size in Figure 2(b). For p = 6, the theoretical speedup should be 1.7. The practical results are between 1.47 and 1.68. They are within 84.29% and 96.51% of the predicted

A L G O R I T H M FOR L I N E A R R E C U R R E N C E EQUATIONS 4 9 9

SPEEDUP

2.8

2,6

2.4

2.2

1 . 8

1.6

1.4

Legend

. . . n=7680 / o �9

. n=11520 / o ~

n=15360 / o ,~" ".

f - "

~ ' ~ ~ 1 7 6 1 7 6 1 7 6

I I t I

5 6 7 8

P

SPEEDUP

2.4 �84

2.2

2

1.8

1.6

3 �84

2.8

2.6

1.4

1.2

1

96O

Legend - - - p=4

p=g

. . o . . . . . . . . . . . . . . . . . . . . . . . . " ' " " ~ ~ ~ D ~ ~ I . . . .

I I ~ I I

1920 3840 7680 11520 15360

Figure 1. (a) Speedup of first order recurrence. (b) Speedup for first order recurrence.

5 0 0 L u , Q I A O , A N D C H E N

2.2

2

1.8

1.6

1.4"

1.2

S P E E D U P

Legend Z

=3=15360 ~ *

- - - n=11520 / , ~ " , '

" " n=7680 / o " �9

2

I I I I

5 6 7 8

P

S P E E D U P

2.4

2.3

2.2

2.1

2

~= 1 . 9

1 . 8

1.7

1.6

Legend

= - - p = 6

, , 1 > = 8

f

1.5 . . . - - ~ -

1.4 s 1920 3840

~ = . . . . . . . . . . . . - - - ~ " " ~ = " ~ =

o

7680 11520 15360

n

Figure 2, (a) Speedup for second order recurrence. (b) Speedup for second order recurrence.


Table 3. Test results for eigenvalue problem Table 3(a): EXECUTION TIMES (in seconds)

n: size of matrix p: number of processors

n 480 960 1920 3840 7680 11520 P

15360

1 7.05 14.11 28.22 56.42 112.88 169.88 225.94 4 4.79 9.28 18.30 36.29 72.33 108.38 143.30 5 4.24 8.20 16.13 32.00 63.64 95.34 126.69 6 3.91 7.50 14.75 29.14 58.00 86.90 115.50 8 3.51 7.00 14.00 27.53 53.47 78.54 104.09

Table 3(b): SPEEDUPS n: size of matrix

p: number of processors

n 480 960 1920 3840 7680 11520 15360 P 4 1.47 1.52 1.54 1.55 1.56 1.57 1.58 5 1.66 1.72 1.75 1.76 1.77 1.78 1.78 6 1.80 1.88 1.91 1.94 1.95 1.95 1.96 8 2.01 2.02 2.02 2.05 2.11 2.16 2.17

Table 3(c): EFFICIENCY (%) n: size of matrix

p: number of processors

n 480 960 1920 3840 7680 11520 15360 P 4 36.80 38.01 38.55 38.87 39.02 39.19 39.41 5 33.25 34.41 34.99 35.26 35.47 35.64 35.67 6 30.05 31.35 31.89 32.27 32.44 32.58 32.60 8 25.14 25.20 25.20 25.62 26.39 27.04 27.13

results. For p = 8, the theoretical speedup should be 2.3. The practical results are between 1.83 and 2.17. They are within 79.02% and 93.57% of the predicted results. These results are consistent with the theoretical analysis in Section 2.

For a given symmetric tridiagonal matrix and different size n, we evaluate its largest eigenvalue by the bisection method usingAlgorithms 3 and 4, and compare the results with the corresponding sequential algorithm. The parallelism is achieved by using the Parallel Programming Library function m_fork on the Sequent B8000. The execution time, speedup, and eff• for different n (matrix size) and p are given in Table 3. It can be seen that the achieved results for the speedup and efficiency are within 62% and 98% of the predicted results (see Figure 3(a)). For a


2.2-

2.1

2

1,9-

~= 1.8

1.7-

1.5-

1.4

4

S P E E D U P

Legend

. . . . . . . a~15360 ~ ' ~ ~

" n=11520 ~ _ j, ~ f

~ o

5 6 7 8

P

S P E E D U P

2.2

2.1

2

1.9-

1.8-

1.7-

1.6-

1.5.

1.4. 48O

Legend

W'8

- ~ p~4

960 192~ 3840 7680 11520 15380

Figure 3. (a) Speedup for tridiagonal eigen. (lo) Speedup for tridiagonal eigen.


fixed n, the efficiency decreases as the number of processors increases. For a fixed p, the efficiency increases with n. Two cases (p = 4 and p = 8) are plotted in Figure 3(b). For p = 4, the theoretical speedup should be 1.6. The practical results are between 1.47 and 1.58. They are within 91.99% and 98.54% of the predicted results. For p = 8, the theoretical speedup should be 3.2. The practical results are between 2.04 and 2.17. They are within 62.77% and 67.83% of the predicted results. This is because the influence of the sequential computation in the new algorithm and the system overhead increases with p. These results are consistent with the theoretical analysis in the previous section. The sequential part of these algorithms is by no means restricted to execution on one processor. For example, the calculation of X(t+l).k+a and X(I+I). k c a n be done in parallel. For large values of p, this calculation can be parallelized recursively.

5. Conclusions

A new parallel method for evaluating the general linear ruth order recurrence problem is proposed and analyzed in terms of its performance on the MIMD environment. Its application to the symmetric tridiagonal eigenvalues problems is presented. The experimental results have achieved expected performance and testify to the efficiency of the parallel algorithm. The results also reflect that the overhead and influence of sequential fraction in a parallel algorithm of an MIMD computation become less significant as the ratio between the problem size and the number of processes p increases.

In conclusion, we note that the segmentation (or partitioning) method can be used in many numerical computation problems for parallel execution. One problem is how to decide the partitioning strategy to minimize the overhead. Much research work needs to be done in the future.

References

[1] R. H. Barlow and D. J. Evans, A parallel organization of the bisection algorithm, Comp. J., 22, pp. 267-269, Aug. 1979.

[2] R. H. Barlow, D. J. Evans, and J. Shanehchi, Parallel multisection applied to the eigenvalue problem, Comp. J., 26, pp. 6-9, Feb. 1983.

[3] B. L Buzbee, G. H. Golub, and C. W. Neilson, On direct methods for solving Poisson's equations, SIAMJ. Numer. Anal., 7, pp. 627-656, Dec. 1970.

[4] D. J. Evans, Design of numerical algorithms for supercomputers, in Scientific Computing on Supercomputers, J. T. Devreese et al., ed., 1989, pp. 101-131.

[5] D. Gajaki, An algorithm for solving linear recurrence systems on parallel and pipelined machines, IEEE Trans. on Computers, C-30, March 1981.

[6] Q.S. Gao, X. Zhang, and J. M. Wang, Another efficient parallel algorithm for recurrence problem, ComputerAppl. and Applied Math., China, 4, pp. 11-15, 1978.

[7] D. Heller, A survey of parallel algorithms in numerical linear algebra, SIAM Review, 20, pp. 740-777, Oct. 1978.


[8] R. W. Hockney, A fast direct solution of Poisson's equation using Fourier analysis, J. Assoc. Comput. Math., 12, pp. 95-113, Jan. 1965.

[9] R. W. Hockney and C. R. Jesshope, Parallel Computers, Architecture, Programming andAlgo- rithms, Adam Hilgor, Bristol, England, 1981.

[10] L. Hyafil and H. T. Kung, The complexity of parallel evaluation of linear recurrence, J. ACM, 24, pp. 513-521, July 1977.

[11] E M. Kogge and H. S. Stone, A parallel algorithm for the efficient solution of a general class of recurrence equations, IEEE Trans. on Computers, C-22, pp. 786-793, Aug. 1973.

[12] R. Ladner and M. Fischer, Parallel prefix computation, J. ACM, 27, pp. 831-838, Oct. 1980. [13] Sequent Computer Systems, Inc., Sequent Guide to Parallel Programming, 1986. [14] H. Wang and A. Nicolau, Speedup of band linear recurrences in the presence of resource con-

straints, Proc. 6th ACM International Conference on Supercomputing, pp. 466-477, 1992. [15] J. H. Wilkinson, The Eigenvalue Problem, Oxford University Press, New York, 1988.

Documents

A parallel algorithm for evaluating general linear recurrence equationscesg.tamu.edu/wp-content/uploads/2012/02/90.-PARALLEL... · 2020. 3. 20. · 1. Introduction The general linear