[IEEE Second Euromicro Workshop on Parallel and Distributed Processing - Malaga, Spain (January 26-28,1994)] Proceedings. Second Euromicro Workshop on Parallel and Distributed Processing

A New Algorithm For Singular Value Decompositions

R. M. S. Ralha Departamento de Matemiitica Informiitica

Universidade da Beira Interior 6200 Covilhii, Portugal

Abstract A new algorithm for computing the singular value

decomposition of a matrix is proposed. Given a matrix A E Rmxn l m > n, the algorithm consists of n - 2 Householder tranformataons A,. + A,-1 .H,., such that T := AT-,An-2 is tridiagonal. A t this point, we carry out the Choleski decomposition T = BTB and use the standard Golub-Kahan SVD iterative step with the bidiagonal matrix B. Our bidiagonalization algorithm is more attractive for parallel processing than the clas- sical methods and at is interesting to observe that it appears to be very competitive even for sequential processing.

keywords: parallel algorithms, singular values, Householder transformations.

1 Introduction With the advent of parallel computers, the Jacobi

family of methods for computing eigenvalues and singular values has received much attention. Parallel Ja- cobi algorithms have been used in real-time signal processing problems arising in statistical analysis, con- trol theory and image processing (see, for instance, [4] and [5]). The current interest in these methods would be much less justified (at least for some of the parallel architectures currently in use) if methods based in the reduction to canonical forms achieved with Householder transformations could be efficiently implemented on parallel machines.

The best serial methods for computing the singular value decomposition (SVD) of a matrix use House- holder Bidiagonalization or R-Bidiagonalization ([SI p.237-239) followed by the Golub-Kahan SVD Step ([8], p.433). However, these methods do not lend themselves very well to implementation on parallel computers.

In this paper a new technique for the bidiagonalization step is proposed which is better suited for parallel computation than the standard algorithms; further- more, we found that for certain sizes of the matrix to be reduced to the bidiagonal form, the new algorithm requires less arithmetic than the usual methods, i.e, it can be very attractive even for sequential processing.

The central idea of this new method is to achieve the transformation of the given rectangular matrix A by only dealing with columns of A, i.e, unlike the clas- sical methods (briefly described in the next section),

our algorithm does not act, at least in an explicit man- ner, upon the rows of the matrix.

2 The standard bidiagonalization methods

The SVD of an m x n matrix A is:

where ui 2 0 are the singular values of A and U(m x m) and V ( n x n) are orthogonal matrices (we will assume that m 2 n).

A major step in the direction of the decomposition (1) is the bidiagonalization procedure:

USAVB = [ ] where B has the following upper bidiagonal form:

" ' 0 1 (3) B = E Rnxn

lo . . . ...

The orthogonal matrices UB and V, are determined as a product of Householder matrices, UB = U1 . . . Un and VB = VI . . . Vn-2. In general, U, introduces zeros into the kth column, while Vk zeros the appropriate entries in row I C ; for instance, if m = 5 and n = 4 we would have the following sequence of transformations:

x x [; 5 i]. x x x x

x x x x o x x x o x x x o x x x o x x x

x x o o o x x x o o x x o o x x o o x x

Vl 4

240 0-8186-5370-1/94 $3.00 0 1994 IEEE

o o x x o o x x o o x x o o o x o o x x o o o x

x o o

o o o x 0 0 0 0

A more comprehensive description of this method, which was first described by Golub and Kahan in [2], can be found in [8], 236ff. This bidiagonalization technique requires 4mn2 - 4n3/3 flops and if the matrices UB and VB are explicitly desired, they can be accu- mulated in 4m2n - 4n3/ and 4n3 flops, respectively.

When m >> n it is advantageous to upper tri- angularize A first, before using the bidiagonalization method. See [8], p. 238, for the details of this algorithm, which is refered to as R-Bidiagonalization. Since it involves 2mn2 +2n3 flops, R-Bidiagonalization is more economic whenever rn 2 5n/3.

3 The new method There is much interest at present in the Jacobi-

Hestenes method (see for instance [6] and [7]) in which a matrix V is found as a product of plane rotations such that the columns of AV = Q are orthogonal. A normalization of the columnE, of Q gives the matrix U , i.e, the required decomposition results:

AV = Udiag(tr1, . . . , U,)

V ~ ( A ~ A ) V = dicLg(u;,. . . , g:)

(4)

(5)

From (4) we get:

and this gives a clear relationship between the SVD of a matrix A and the Schiur decomposition of the symmetric matrix ATA.

The link between Hestenes’ (for the SVD) and Ja- cobi’s methods (for diagonaliaation of a symmetric matrix) is clear: the first one consists in applying on the right side of A the transformations that, if used on both sides of ATA, would produce a diagonal matrix with the squares of the eigenvalues of A on the diagonal. However, the product ATA is not explicitly computed.

Similarly, our algorithm consists in applying on the right side of A the Householder transformations that if applied on both sides of ATA would produce a tridiagonal matrix T , i.e, we have:

T = (AHI . . .H,-;!) (AH1.. .H,-z) (7)

However, we do not compute the product ATA and do not carry out the transformation expressed in

(6); instead, we do only post-multiplication with the Householder matrices H, as expressed below, where A0 represents the initial matrix A:

(8) A,. tA,-1.H,., r = 1 , ’ . . , n - 2

We now compute the tridiagonal form T := Ar-2An-2. If we applied the implicit-shift QR algorithm we could get the Schur decomposition of T :

and the product A,-zQ would have orthogonal columns, as required. However, working with T can lead to loss of information because the condition of the tridiagonal matrix T is the square of the condition of the initial matrix A . It is for this reason that the Golub-Kahan SVD algorithm ([8], 430ff) consists in the transformation

uT BQ = diag(a l , . . . a,) (10)

where Q is the orthogonal matrix (a product of rotations) that would be used if the symmetric QR algorithm had been used to produce the transformation

Our proposal is to carry out the Choleski decom- (9).

position

T = B ~ B (11) which completes our bidiagonalization procedure. The Golub-Kahan SVD Step can now be used in the usual way with this matrix B.

4 Householder tri-ort hogonalization

As it can be understood from what has been said in the previous section, the bulk of the computational work in our bidiagonalization technique lies in the “tri- orthogonalization” process given by (8). In this section we present the relevant details of this transformation.

Let us suppose that we had computed the symmetric matrix ATA and start carrying out the tridiago- nalization with Householder matrices, as expressed in (6). As it is well known, these matrices have the form

and they are usually used to zero selected components of a vector ([8], 195ff). In particular, given a vector a:, if we choose the Householder vector v:

it results that H z is a multiple of e l , the first column of the identity matrix, i.e, Ha: is zero in all but the first component.

With HI built as described by (12) and (13) for 2 1 , the vector of the (n-1) last components of the first column (or row) of ATR, we would get:

241

( A ~ A ) H~ =

x x 0 0 e . . 0

0 x x x . . . x 0 x x x . . . x

0 x x x * . . x

x x x x . . . x

. . . . . . . . . . . .

However, as has been said before, we do not compute ATA neither do we carry out the Householder pre-multiplication. We only need to compute the last ( n - 1) elements of the first row of A*A to form H i , which we use to produce A1 = AH,; the zeros in the first row of AyA1 in (14) imply that we have for the columns ai of A1 = [alaza3. . .a,]:

= o for i = 3 , . . . , n (15) In general, in the r th step, the computation of the

Householder matrix H , involved in the transformation (8) requires the (n-r) last elements of the r th column of AT-lA,.-l and, since we do not have this product explicitly formed, we need to compute the (n--T sdots with the appropriate columns of A,-1. Let 2,. l!e this ( n - F ) vector.

From x, we compute w,., in the way given by (13) , and the Householder matrix (12). The transformation (8) can be written as

A,. c A,-1 - ,Or (Ar-lwr) w, , r = 1, . . . , n - 2 (16) T

The columns ai of A,-1 are such that

a T a j = O f o r i = l , . . . r - l , j = i + 2 , . . . n (17) For each r , the transformation (16) will act upon the last n - r columns, i.e, the first r columns remain unchanged; on completion of (16), the property (17) will extend to i = r .

The complete transformation can be expressed in the following way:

For r = l to n-2, compute: 1) the vector x,. of the last (n-r) elements

of the r th row of ATA; 2) the Householder vector U,. ; 3) the scalar pr = 2 / ( w ? ~ r ) ;

the matrix-vector product A,-lv,; the scaled vector w,. := ,Or (Ak-lvr);

6) the update A, = A,-1 - w,v,

Steps 4 and 6 are responsible for the major part of the computational work involved in the transformation; for reasons that will become clear when we dis- cuss the parallel organization of the algorithm, we will express both the matrix-vector multiplication A,-1 w, and the update A,. = A,-1 - w,vT in terms of s a q y operations. If, for each r , aj represents the j t h column of the current matrix A, w is the vector W r , x( j ) and w(j) are the j t h components of vectors z,. and w,, respectively, we have:

Algorithm 1 (Householder tri-orthogonalization): For r= l to n-2

For j=r+1 to n

end j compute the Householder vector v compute ,O := 2/vTw with initial w = 0 : For j = r + l to n

end j scale w := /3w For j=r+l to n

end j

compute ( A t ) x(j) := aT.aj

compute (saxpy) w := w + w(j)aj

compute (sazpy) aj := aj - w(j)w

end r

5 Counting the flops

Since the Choleski decomposition (11) involves only 3n flops and n square roots, the arithmetic complex- ity of the new bidiagonalization algorithm is essentially dependent on the number of flops required by Algorithm 1; this requires (n + l ) (n - 2) sazpy and about half this number of sdot operations with vectors of length m which gives (approximately) a total of 3mn2 flops involved. The Golub-Kahan bidiagonalization and R-bidiagonalization take 4mn2 - 4n3/3 and 2mn2 +2n3, respectively, thus the second method is preferable whenever m 2 5n/3 ([8],p.238-239); it is therefore interesting to observe that our algorithm involves less computation than any of the standard methods for m E [gn, 2n], more pre- cisely, it is more economical than the Golub-Kahan's algorithm if m > $n and more economical than R- bidiagonalization when m < 2n.

6 Parallel Bidiagonalization Although we have just concluded that the new

method is very competitive with the standard algorithms, we are more interested in the advantages of our method for implementation on a multiprocessor with distributed memory.

Since Algorithm 1 is almost entirely expressed in terms of saxpy of full length vectors (the columns of A) we suggest that a good decision concerning the partition of the matrix is to do it simply by blocks of rows, i.e, each one of p processors will store in its local memory n vectors of length m / p ; the load balancing is good and no interprocessor communication is required as far as the computation of each saxpy is concerned.

It must be noted that as the transformation pro- gresses, i.e, as T approaches n - 2, the number of columns of A we are dealing with becomes smaller (in the r th step we work with n - r columns). Since it is the number of columns that is reduced during the

'In [9] we proposed a different algorithm for eigenvalues computation that can also be expressed in terms of s a z p y of the columns of two matrices, whereas in the present case we are dealing with one single matrix.

242

transformation and not the size m of each column, our partitioning strategy avoids the load unbalance that would result if a less adequate distribution of the matrix had been done.

For the computation of t,he vector 2 in each step, things are not so perfect but the overheads of the parallel algorithm are still quite small under very reasonable assumptions. In each of the n-2 steps, to compute the n-r sdots (in the rth step:l required, each processor computes its contribution with the segments of size m/p of the vectors it holds; following this, the processors must cooperate in such a way that the global sdots are produced and every processor gets a copy of these values.

We believe that one major advantage of our parallel algorithm is its simplicity and portability: since only the sdots require cooperation of the various processors, the parallel code is (essentially the sequential code with a procedure to compute the sdots in parallel. This procedure, which we chose to designate GLOBALSDOT, must be inserted in the parallel code at the point where the sdods appear in the sequential code. This is expressed in Algorithm 2, which is the parallel version of the sequential Algorithm 1; in the parallel algorithm, aj represents the local data (a vector of size m/p) of the j t h column of the current matrix A:

Algorithm 2 (PARALLEL tri-orthogonalization): For PE=1 to p (each processor does the same)

For r= l to n-2 For j=r+1 to n

end j compute the Householder vector v compute p := 2/vTv with initial w = 0 : For j=r+l to n

end j scale w := Pw For j=r+l to n

end j

GLOBAL.SDOT 1(2(j) := u ? . u ~ )

compute (saxpy) w := w + v(j)aj

compute ( s a q y ) c i j := aj - v(j)w

end r end PE

The procedure GLOBAL.SDOT encapsulates the only part of the code where interprocessor communication is required and the relative cost of its implementation depends on the connectivity of the network. We point out that in order to run our parallel algorithm with different processor configurations we just need to include the right version of GLOBAL.SDOT, which reflects the particular configuration used.

7 Efficiency analysisi

The optimal processor configuration to implement the procedure GLOBAL.SDOT is a tree: in a binary tree with p processors, each sdot can be carried out and the result sent to each processor in U(log2 p ) steps. However, we will show now that parallel Algorithm 2

will be efficient under reasonable conditions, even for networks with poor connectivity,

Let US suppose that we have nothing better than a simple chain of processors. The procedure GLOBALSDOT can be the following: the first processor in the chain sends its contribution, say SI, to the next processor, this adds SI with s2 (the local contribution for the sdot) and sends the result to the next processor and so forth to the last processor which pro- duces the final value of the sdot and sends it back to the previous processor; every processor will get the required value and will send it to the previous processor.

How large must m be, compared to p, for our parallel algorithm to be efficient? Representing by t j l o p the time taken by a floating point operation (assumed constant, in our model) and by t,,, the time required to send a floating-point number from one processor to a neighbour (the previous one or the next one in the chain), we get for the cost (time) of the operation that starts with the first processor sending SI and ends with the same processor receiving the required value of the sdot:

(P - l ) ( t j l o p -t tcom) + (P - 1)tcom (18)

Since we are repeating this operation for each one of (n + l ) (n - 2)/2 d o t s , we get the following estimate for the overhead due l o the parallel computation of the sdots:

(19) n2 2 P.-(tjlop + 2tC0,)

and this overhead will be small if the ratio m/p is reasonably large. However, there is another source of inefficiency since in Algorithm 2 each processor computes the same Householder vector v and repeats the scaling w := Pw; from (13) we see that vectors 2 and v only differ in the first component and to compute VI it is necessary to find the norm

11~112 := d-([2(2)]2 + . . . + [z(n - r ) ] ~ ) (20)

and in the r th step this involves about 2(n - r ) flops, thus the overhead due to this redundancy in the computations is approximately equal to pn2tflop. The computation of p takes a negligible amount of arithmetic since we have

n

and the expression inside brackets has already been calculated in (20). The product Pw, carried out by each processor in each step, contributes with another pn2 flops. The total overhead caused by this redundancy is approximately equal to 2pn2tjrop.

With this estimate and the one given by (19), we have for the efficiency Ej of our parallel algorithm the following expression:

Ej = 3mn2.tflop 3mn2.tjlop + (2pn2 + p?) tjlop + pn2.tcO,

243

(22) [61

[71

1 - - I + & (!+*)

The number TI of columns is irrelevant here: our algc- rithm will be efficient provided mlp is large enough, as we said before; for instance, if the fundamental, hard- ware dependent ratio is as large as 10, we get, from (22), an estimate in excess of 90% when > 40.

8 Conclusions

We have proposed a new algorithm for computing the singular value decomposition of a rectangular matrix A(m x n) and have shown that it is competitive with the best methods known. The new method is much better suited for parallel processing than the standard techniques since the major part of the computations can be expressed in terms of operations with full columns of A.

The parallel “trz-orthogonalzzation” algorithm we have proposed is simple and does not require a complex interprocessor connectivity. We carried out a de- tailed analysis showing that even for a simple chain of processors the parallel method can be very efficient provided the number m of rows is large enough compared with the number p of processors.

Based in this analysis, we believe that the new parallel method has the desirable property of scalability, i.e, the efficiency does not necessarily go down when we add more processors to the system, provided we mantain the ratio mlp large enough.

We are currently implementing the parallel algorithm in a multiprocessor machine (based on trans- puters) and results will published elsewhere.

References PI

181 P

[91

M. R. Hestenes, Inversion of matrices by biorthogonalization and related results, J . Soc. Indust. Appl. Math., ~01.6, p.51-90, 1958.

G.H. Golub and W. Kahan, Calculating the Singular Values and Pseudo-Inverse of a Matrix, SIAM J . Num. Anal. 2 (1965),

J . C. Nash, A one-sided transformation method for the singular value decomposition and algebraic eigenproblem, The Computer Journal, vo1.18, n.1 (1975), p.74-76.

Large Scale Eigenvalue Problems, Pro- ceedings of the IBM Europe Institute Workshop on Large Scale Eigenvalue Problems held in Oberlech, Austria, July 8-12, 1985, edited by J . Cullum and R. A. Willoughby, Nort h-Holland, 1986.

SVD and Signal Processing, Algorithms, Applications and Architectures, E. F. De- prettere (editor), Elsevier Science Pub- lishers B. V. (North-Holland), 1988.

PI

205-224.

[31

[41

[51

M. Annaratone, Singular Value Decom- position on Warp, in [5] p.425-438.

L. M. Ewerbring and F. T. Luk, Com- puting the Singular Value Decomposition on the Connection Machine, in [5], p.407- 424.

G. H. Golub and C. F. Van Loan, Matrix Computations, 2nd edition, The Johns Hopkins University Press, Baltimore and London, 1989.

R. Ralha, Parallel One-sided Householder Transformations for Eigenvalues Compu- tation, IEEE Proc. Euromicro Workshop PDP 1993.

244

Documents

[IEEE Second Euromicro Workshop on Parallel and Distributed Processing - Malaga, Spain (January 26-28,1994)] Proceedings. Second Euromicro Workshop on Parallel and Distributed Processing