[IEEE Comput. Soc. Press Euromicro Workshop on Parallel and Distributed Processing - San Remo, Italy (25-27 Jan. 1995)] Proceedings Euromicro Workshop on Parallel and Distributed Processing

Parallel QR Algorithm for the Complete Eigensystem of Symmetric Matrices

Rui M. S. Ralha Departamento de Matemritica

Universidade do Minho Campus de Gualtar

4700 Braga, Portugal E-mail: [email protected]

Abstract I n this paper we propose a parallel organization of

the Q R algorithm for computing the complete eigensystem of symmetric matrices. W e developed Occam versions of standard sequential implementations of the Q R algorithm: the procedure qrl which computes only eigenvalues and qr2 for the computation of all eigenvalues and eigenvectors. The Occam procedure parqr2 is a parallel implementation of qr2 and was tested on a pipeline of 16 transputers. Although parqr2 could be used to compute the eigenvalues and eigenvectors of a symmetric tridiagonal matrix, it is best suited to be used in conjunction with a parallel algorithm for the reduction of a dense symmetric matrix to tridiagonal f o r m where the orthogonal transformations are accumulated in an explicit way. We have proposed a parallel algorithm for this purpose in 1151. In the p r a - tical tests parqr2 has proved to be efficient and we have carried out a simple analysis that appears to indicate that it is possible to use efficiently a number p of processors of the same order of magnitude of the size n o f t h e matrix (p 5 n/6). This is an interesting result from the point of view of the scalability of our parallel algorithm.

keywords: eigenvalues, eigenvectors, Q R algorithm, Occam, transputer.

1 Introduction

The QR algorithm, first proposed in [l], is very efficient for calculating the eigenvalues of symmetric tridiagonal matrices. The procedures tqll , for computing eigenvalues only, and tq12, for computing the complete eigensystem (;.e, eigenvalues and eigenvectors), given in [3], are ALGOL implementations of the QL algorithm which is theoretically equivalent to the QR algorithm.

Although the QR algorithm for symmetric tridiagonal matrices is sequential in nature, in [6] the au- thors presented a parallel QR algorithm but it is not well suited to multiprocessors with distributed memory (only the arithmetic costs of that algorithm have been considered and no attention was given to the movement of data).

In the present work no attempt will be made to develop a parallel implementation of tqll (after all, the arithmetic complexity of this algorithm is only O ( n 2 ) ) but the parallel computation of the complete eigensystem, which involves a number of arithmetic operations proportional to n3, will be considered.

2 The QR algorithm with Wilkinson’s shift

The QR algorithm is based on the fact that if the following decomposition holds:

A = Q R (1) with Q an orthogonal matrix and R upper triangular, then the matrix

R Q = QT A Q is similar to A (has the same eigenvalues). Represen- ting by A0 the initial matrix A , the iterative algorithm in its simplest form is given by:

Ql.4k = Rk, Ak+l := & Q k , k = 0 , 1 , . . . (3) In general, when A has n distinct eigenvalues, Ak

tends to the diagonal matrix of the eigenvalues. Al- though the QR algorithm is quite expensive for general matrices (O(n3) , flops per iteration), it assumes a much more economical form in the case of symmetric tridiagonal matrices (O(n) flops per iteration). Since a full symmetric matrix can be reduced to the tridiagonal form by a sequence of n - 2 steps, the iterations expressed in (3) are always carried out with the tridiagonal form.

Let Ak be the following matrix, where the suffix k is omitted in the elements, for simplicity:

(2)

dl e2 e2 d2 e3

. . . . . .

. . en-1 dn-1 en

en dn

(4)

1066-6192/95 $4.00 0 1995 IEEE 480

It is known [4] that each e, tends to zero like i.e, corresponding to the iteration (9) we carry out the t r ansfor mat ion:

( 5 )

where I C ~ is a constant and the eigenvalues are in de- creasing order of their magnitude, i.e,

pi1 5 p i - 1 1 , i = 1 > I . . ' 72 - 1.

The speed of convergence of the algorithm in this form is generally inadequate and is very much improved by working with

where uk is a suitable constant (the origin shift) chosen at each stage; now the contribution of the kth step for the convergence of the off-diagonal elements ei is determined by

Ak - UkI (6)

(7)

and if b k is chosen to be close to an eigenvalue of Ak then this eigenvalue tends to emerge rapidly at the bottom of the resulting matrix.

There are several possible choices for the shift b k ([7], p.162 ff); the following one has proved to be very effective in practice: if at the current stage we are working with a matrix whose block of size 2 at the bottom is

L 1

then b k is taken to be the eigenvalue of this block which is closer to d,. Wilkinson [4] has shown that this shifting strategy guarantees the convergence and usually gives cubic convergence (in practice the average number of iterations per eigenvalue is often less

th?2k iteration of the QR algorithm is therefore given by

and Qk is the product of plane rotations. If r is the size of Ak, i.e, assuming that n - r eigenvalues have been found in the first k iterations, we have:

Ak+l = QC (Ak - U k l ) Q k (9)

&k = P1.P2 . . .P , - - l (10)

where Pi (we also omit the suffix k in these matrices) has the following form, with c: + s: = 1:

If also the eigenvectors are to be computed, then the successive transformations must be accumulated,

where VO is:

0 the identity matrix if the eigenvectors of the tridiagonal matrix A0 are to be found;

0 a full orthogonal matrix if the tridiagonal matrix A0 is the result of a previous reduction to this form of some matrix S : A0 := Vz.S.Vo. This will deliver the eigenvectors of the original matrix S in the array V (this is the matrix where the successive vk , k = 0,1, . . are overwritten).

3 A sequential Occam implementation of the QR algorithm

Two Occam procedures have been developed: yrl which finds all the eigenvalues of a symmetric tridiagonal matrix, and qr2 for the computation of the complete eigensystem. These codes have been written in a strictly sequential style.

When producing the code for qrl and qrt, it was our purpose to write an Occam "translation" for the ALGOL programs tq11 and 1912 given in [3]: this is not straightforward due to the use of GO TO instructions in these ALGOL codes.

If at any stage an intermediate e, becomes negligible, i.e, smaller than a prescribed tolerance, the matrix splits into the direct sum of two submatrices and the eigenvalues of each submatrix may be found without reference to the others (when this happens, the economy in computation can be considerable). In the ALGOL code of 1411 this situation was handled by using several GO TO instructions.

In yrl the following strategy has been used for handling the possibility of the decomposition of the matrix as a direct sum of two or more submatrices: each time some value e, is found to be negligible and this e , is not the current last e i , one submatrix is "put in a stack", i.e, the following information about this submatrix is stored: the index i of the top element d,, the size of the submatrix, and the total shifted up to this point. To store this information the procedure qrl requires 3 small arrays (in our code we have used a size of 5 and this has been enough even for matrices of order n = 1000).

We think it is not adequate to present here the full details of the implementation nor the Occam code it- self. The procedure qrl is more verbose than the AL- GOL procedure t q l l but it is also much simpler to understand. Similar techniques manipulating one or more stacks in an explicit manner must be used when- ever some form of recursion is required because this is not directly supported by the Occam language.

The Occam procedure qr2 is simply obtained from yrl by including in each QR iteration (9) the loop that produces the transformation (12).

481

4 Parallel computation of the eigenvectors

n 45 60 15

We said before that the QR algorithm with origin shifts has a sequential nature. It must be noted at this point that several QR iterations (9) can overlap in time (in a wavefront style) if each iteration is started before some of the previous ones are finished; however, this is not compatible with the use of the optimal shifting strategies in which the shift produced at the end of one iteration is used in the next iteration.

However if we want the complete eigensystem (all eigenvalues and corresponding eigenvectors) rather than only eigenvalues, there is a lot of parallelism to be exploited in the algorithm. An efficient implementation of (9) takes about 20r flops and r square roots for a matrix A , of size r 5 n, whereas the transformation 12) involves a total of 6n(r - 1) since each

columns of V , vi and vi+l , with the corresponding linear combinations:

post-mu 1 tiplication by pi replaces the ith and (i+l)th

Tqr1 Tqr2 ( T q r 2 / 7 q r l ) - 1 0.09 0.8’ i 8.7 0.16 1.92 11 0.24 3.59 14

Therefore, while the QR algorithm for calculating only the eigenvalues is a O ( n 2 ) process, the accumulation of the orthogonal transformations is a O(n3) process and consequently, for large n , it is more urgent to develop an efficient parallel algorithm to compute eigenvectors.

The transformation (12) depends on the iteration (9) since this produces the numbers { c i , s i } that are used in the linear combinations (13); thus, the alg* rithm can be decomposed in two concurrent processes, the master and the vec, say: the master process after finishing each iteration (9), sends the generated values { c i , s i } to the vec process which uses them to update the corresponding columns of the matrix V.

Thus, we are pipelining computations: as soon as the master finishes the kth iteration it will proceed with the next iteration and when vec terminates the transformation that corresponds to the kth iteration there will be a new set of parameters at his disposal (those produced in the (k+l)th iteration).

There is of course much more parallelism than that expressed with only two concurrent processes since the transformation (12) with vectors of full size n can be decomposed in p identical concurrent processes, say vec(l), vec(2), ... vec(p), where vec(i) deals with a segment of size n / p (we assume that p is an exact divisor of n, for simplicity).

In this case the role of the master is exactly the same as with one single vec and each processor in the chain (figure 1) repeatedly receives a set { c i , s i } from the left, passes it along to the next vec processor and updates the local data according to expressions (13).

Each message broadcast from the master contai- ning the set { c i , s i } also includes the index of the first column of V to be updated since it can be different from 1 when the matrix Ak splits into the direct sum of submatrices.

Figure 1: A chain of processors for the parallel QR algorithm.

5 Efficiency analysis

The aim of the analysis carried out in this section is to provide an upper bound for the number p of processors that can be used efficiently in the computation of the eigenvectors and also to give an estimate of the communication overhead.

5.1 Load balancing There is an initial delay in filling up the vec pipeline

(this delay is typical in pipelined computations) since the vec processors must wait idle for the first set { c i , s i } to be produced by the master. After this, we want every processor to be active most of the time, more exactly we aim to avoid, as much as possible, a situation where the vec processors do need to wait for the messages flowing through the pipeline.

Representing by 7:’ and 7:’ the time used, in the kth iteration, by the master and by each vec processor, respectively, the optimal load-balance is achieved when T:+’) = T?’. This is hard to satisfy in practice and can be replaced by the suitable inequality:

since in this case the load-balance still is very satisfac- tory because only the master will eventually wait idle for the other processors. Although the master could run well ahead of the others it would need to buffer the sets { c i , s ; } produced in each iteration: this requires extra memory space, eventually not available. In our implementation the master starts the (k+l)th iteration only after vec(1) has accepted the data corresponding to the kth iteration.

Table 1: execution time, in seconds, for qr1 (eigenvalues only) and qr2 (complete eigensystem).

482

The condition (14) imposes a bound on the number p of uec processors as a function of the size n of the matrix (of course, p grows with n). To get a realistic estimate for this bound, we measured the times, Tqr1 and rqr2, required by the procedures qrl and qr2 on a single transputer for different values of n . The results (in seconds) are presented in table 1: the time spent with the transformations (12) is, of course, rqr2 - Tqrl and we must have:

which is equivalent to the following:

Since rqr2 grows cubically with n whereas rqr1 grows only quadratically, we expect the bound for p to grow linearly with n: from the values in table 1 we conclude:

and we get the following heuristic bound for p :

i.e, if p satisfies the above condition then the inequality (14), crucial for loud balancing, will hold in practice for almost every value of k. It can only fail if the size, say mk, of the matrix handled by the master in the kth iteration is smaller than the the size m k + l of the matrix in the following iteration. However, in general we will have mk+l = m k (when no deflation occurs) or mk+l = mk - 1 (when one eigenvalue is found in the kth iteration and deflation occurs).

The analysis carried out shows that there is enough parallelism in the algorithm for a number of processors of the order of magnitude of the size n of the matrix; furthermore, it must be emphasised that, as long as 18 is satisfied, load balancing is not disrupted by the L A e ation of the matrix because as the size of Ak in (9)

shrinks, the number of columns of vk transformed in (12) also shrinks but not the size n of each column which is kept constant.

5.2 Communication costs To quantify the communication overheads we use

the model given in [14], p. 269-270: sending m floating point numbers between two processors requires

comm(m) = a d + P d m (19)

seconds; f f d is the time required to initiate a message (the set up time) and p d is the rate that a message can be transferred. In our algorithm for each floating point number that is sent from one processor to an- other, a total of $ flops are performed. Note that f f d

is a fixed cost, independent of the size of the message,

therefore it is relatively less important for longer messages. The length of the messages in parpr2, the parallel implementation of qr2, varies during the execution of the algorithm: this length starts with a maximum of 2 ( n - 1) (for the initial matrix A0 we need n - 1 plane rotations in (9) with two values, c; and si, for each rotation but it will decrease successively down

the average length of the messages is therefore simply n and we have, with R the number of flops per second:

to only 2 in t h e last iteration (for a matrix of size 2);

(20 ) Time spent computing (F) lR z

Time spent communicating 2 ( a d + P d A )

This fraction quantifies the overhead of communication relative to the volume of computation. For n very large, the set up time becomes negligible and even with p as large as as n / 6 , parqr2 will be quite efficient. Fur- thermore, we have significantly improved the performance of the parallel algorithm by overlapping communication with computation on each transputer; this is possible since:

0 the communication links of the transputer can operate concurrently with the processor;

0 each vec processor can receive the set {c,, si} produced in a certain iteration ( 9 ) while still carrying out the update that corresponds to the previous iteration.

The gain achieved with this optimization is described in the next section.

6 Results We tested the Occam codes qr2 (on a single trans-

puter) and parqr2 (with a pipeline of 16 transputers) for matrices of several sizes up to n = 300 (our main limitation here w a s the size of the memory available on each transputer). The numerical results obtained with the parallel algorithm parqr2 coincide with those produced by the sequential code qr2 and were found to be correct by computing the rpsiduous 1lA1 - Xzll for each eigenpair ( A , x).

The efficiency is computed in the usual way:

where T(l) is the time necessary to execute qr2 on a single processor (represented by rqr2 in table 1) and T( 16) is the time required by parqr2 to execute on 16 processors.

The gain obtained by overlapping communication with computation can be appreciated by comparison of the values given in tables 2 and 3 (more than 10% for the larger values of n). T ( l ) and T(16) are given in seconds. Note that for the smaller values of n, i.e, for n < 100, the condition (18) is not satisfied and, according to our study in section 5.1, the vec processors are not fully active.

483

in the analysis carried out before. For the larger values of n we have observed a gain of about 7% in efficiency when this overhead was removed.

Furthermore, the allocation of the initial data for each processor poses no problem if the parallel algorithm described in this paper is used in conjunction with the reduction to the tridiagonal form given in [15] since the same decomposition of the data has been used there, i.e, in the parallel algorithm in [15] each processor produces n / p rows of V that it is going to update in parqr2.

Table 2: Efficiency of parqr2 (no comm./comp. overlap). 7 Conclusions

n 45 60 75

T ( l ) T(16) E 0.87 0.11 48% 1.92 0.21 56%- 3.59 0 .35 63%

00 59.3 4. 250 I 113.8 I 8 1 89% 300 I 193.2 [ 13.3 I 91%

Table 3: Efficiency of parqr2 (comm./comp. overlap).

Furthermore, since in most cases n is not a multiple of p = 16, a certain number of processors will handle one more row of V than the remaining ones; the neg- ative effect of this in the load balancing is, of course, more evident for the smaller values of n .

In the overlapped version, communication has to be decoupled from the computation and this can be done as follows, using two packets of data, packet A and packet B:

SEq input packet A WHILE (not . f in i shed)

SEQ PAR

use packet A input packet B output packet A

use packet B input packet A output packet B

PAR

Some care must be taken when writing the final code for assuring correct termination: if the message for termination (not . f in i shed : = FALSE) arrives in the first PAR then the second PAR is not to be exe- cuted; without this care the program would deadlock.

I t is very im ortant to stress out that in both cases (tables 2 and 37 T(16) includes the time spent by the master in the process of collecting the results (matrix of eigenvectors) from the wec pipeline. This overhead is very significant and has not been taken into account

The sequential Occam procedure qrl is very fast for calculating all eigenvalues of a symmetric tridiagonal matrix (a single transputer took 3.4 seconds to find the eigenvalues, with 7 decimal digits, of a matrix of size n = 300). Compared to this, the computation of the eigenvectors by accumulation of the successive orthogonal transformations (procedure qr2) takes much longer but provides an excellent opportunity for parallelism that has been successfully exploited in parqr2. Although the time required by parpr t can never be less than the time required by qrl, these times can be of the same order of magnitude if many processors are used for the accumulation of the orthogonal transformations.

The analysis we have carried out suggests that more processors can hc efficiently used when the size n grows, Le, our parallel algorithm is scalable. The gain obtained in our transputer implementation by overlapping computation and communication is very significant.

References

J . G . Francis, The QR Transforma- tion. A Unitary Analogue to the LR Transformation-Part I , Computer Jour- nal, vo1.4 (1961-62), pp. 265-271.

J . H . Wilkinson, The Algebraic Eigen- value Problem, Oxford University Press, 1965.

H Bowdler et al, The QR and QL al- gorithms for Symmetric matrices, Nu- merische Mathematik, 11, 1968, pp. 293- 306.

111

PI

[31

J . H . Wilkinson, Global Convergence of Tridiagonal QR Algorithm with Origin Shifts, Linear Algebra and its Applica- tions, 1, pp. 409-420 (1968).

J . H . Wilkinson and C. Reinsch, Hand- book for Automatic Computation, vo1.2 (Linear Algebra), Springer-Verlag 1971,

[41

151

A . H . Sameh and D. J . Icuck, A Parallel QR algorithm for Symmetric Tridiagonal Matrices, IEEE Transactions on Comput-

B. N . Parlett, The Symmetric Eigenvalue

D. S. Watkins, Understanding the QR Al- gorithm, SIAM Review, vo1.24, n.4, 1982,

P. Atkin, Performance Maximisation, IN- MOS Technical Note 17, INMOS Limited,

D. J . Pritchard et al, Practical parallelism using transputer arrays, Research Journal 1988, Department of Electro- nics and Computer Science, University of Southampton.

[ll] El

ers, vol. C-26, n.2, 1977. [I21

Problem, Prentice-Hall 1980. 1131 [71

[I41 PI

pp. 427-440.

[91

Bristol 1987. 51

[lo1

Transputer Development System, IN MOS Limited, Prentice Hall, 1988.

Occam 2 Reference Manual, INMOS Li- mited, Prentice Hall, 1988.

Transputer Reference Manual, INMOS Limited, Prentice Hall, 1988.

G. H. Golub and C. F. Van Loan, Matrix Computations, 2nd edition, The Johns Hopkins University Press, Baltimore and London, 1989.

R. Ralha, Parallel One-sided Householder Transformations for Eigenvalues Compu- tation, Proceedings of Euromicro Work- shop on Parallel and Distributed Pro- cessing (Gran Canaria, January 27-29), IEEE Computer Society Press 1993.

485

Documents

[IEEE Comput. Soc. Press Euromicro Workshop on Parallel and Distributed Processing - San Remo, Italy (25-27 Jan. 1995)] Proceedings Euromicro Workshop on Parallel and Distributed Processing