8
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS. VOL. 36. NO. 4. APRIL 1989 553 Concurrent Forms of Signal Processing Algorithms WILLIAM A. PORTER, SENIOR MEMBER, IEEE Abstract -This study is concerned with forms of algorithms that facili- tate rapid processing on affiliated systolic arrays. One thrust of the development identifies classes of linear maps A: E" + E" that can be computed on p X p arrays at speed O( p) where p = 6. The array architectures which provide the requisite computational support are identi- fied. A second thrust considers the expansion of arbitrary linear maps in terms of the fast maps. The results include a definitive method for minimal expansions and for best approximations of an a priori order. The study closes with a detailed comparative example which illustrates the principles in question. I. INTRODUCTION HIS STUDY is motivated by the many signal extrac- T tion, parameter estimation, and automatic control al- gorithms which utilize matrix/vector operations. When dimensionality is large or iterations are numerous or the application requires real time processing, the speed of the basic matrix operations is a primary concern. Several authors (see, for example, [1]-[3]) have pointed out that matrix operations can be performed using array structures. While the details of specific designs vary, the speed of matrix/vector operations is proportional to the size of the matrices involved. For n x n matrices a matrix/vector operation, using n multipliers in parallel, can be achieved in O(n) multiplication time units. For even moderately large n. T h s computational speed may not be adequate for real time applications. It is of interest to take note of a special class of linear algorithms which permit computation in O(6) time. This class of algorithms includes several important members such as the DFT and the circulant matrices. In an earlier study [4] it has been shown that every linear algorithm can be decomposed into a finite sum of such special class algorithms. T h s decomposition is referred to as a concur- rent triple product, CTP, decomposition. Each CTP pro- vides the basis for speed versus hardware comparisons and for comparisons with conventional array processing. The principle on which the CTP decomposition is based is illustrated by the following example. Let x be an n- dimensional vector with n = pq. Let X be a p X q matrix formed by mapping the vector on a two dimensional array Manuscript received August 18, 1987; revised March 4, 1988 and July 18, 1988. This work was supported in part by SDIO/IST-ARO under Contract D24962-MA SDI. This paper was recommended by Associate Editor R. Ansari. The author is with the Department of Electrical Engineering, Louisiana State University, Baton Rouge, LA 70803. IEEE Log Number 8826272. (for example using segments of x as columns of X). Define the linear operations y= Ax, Y= LXR. Using conventional arrays the matrix-vector operation y = Ax takes O(pq) time. Using similar arrays the matrix-matrix operation Y= LXR can be computed in O( p + q) time [4]. The relative speed of the two types of linear operations (on the same data x versus X formats) is thus apparent. Clearly the triple matrix product can be up to fi times faster than the matrix-vector multiplication. It was noted above that there are well-known cases in which both linear operations, Ax and LXR, represent the same algorithm. One such case is the n points DFT algorithm. The triple matrix form can be derived using Good's prime factors technique [5]. Our interest, however, is with arbitrary linear maps and a more powerful version of the CTP representation. In particular we establish mini- mal fast CTP forms for arbitrary linear maps. 11. SOME RESULTS ON CTP DECOMPOSITIONS Let E" denote the usual Euclidean space (real or com- plex). The data segments to be processed will be repre- sented as vectors x E E" where n corresponds to the number of points to be processed. A linear algorithm, A, operating-on the data points will be representable by a matrix, A, acting on the vector x. For simplicity we shall slur over the distinction between A, a. We also consider A to be square. Hence the algorithm is described as a matrix-vector operation, to summarize (1) y = Ax, x, y E E". Let n = pq and assume a fixed (but arbitrary) reformat of n x 1 vectors in p x q matrices, in, short a congruence r: E" + EPxq. We continue the notation X= r[x] where convenient and let r-'[ XI = x denote the obvious inver- sion of r. The matrices La E EPXp and Ra E Eqxq, a = 1 . . . k, constitute a k term CTP decomposition of A E EnXn provided Ax = T-'[ ilLa~[x]Ra], all x E E". (2) We write A = xaLa 0 R" with the meaning of (2). The results of [4] include the following. Lemma 1: Every matrix AE EPqxPq admits a k term CTP decomposition with k < min { p2, q2}. 0098-4094/89/0400-0553$01.00 01989 IEEE THE INSTITUTION OF ELECTRICAL ENGINEERS

Concurrent forms of signal processing algorithms

  • Upload
    wa

  • View
    216

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Concurrent forms of signal processing algorithms

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS. VOL. 36. NO. 4. APRIL 1989 553

Concurrent Forms of Signal Processing Algorithms

WILLIAM A. PORTER, SENIOR MEMBER, IEEE

Abstract -This study is concerned with forms of algorithms that facili- tate rapid processing on affiliated systolic arrays. One thrust of the development identifies classes of linear maps A : E" + E" that can be computed on p X p arrays at speed O( p ) where p = 6. The array architectures which provide the requisite computational support are identi- fied. A second thrust considers the expansion of arbitrary linear maps in terms of the fast maps. The results include a definitive method for minimal expansions and for best approximations of an a priori order. The study closes with a detailed comparative example which illustrates the principles in question.

I. INTRODUCTION HIS STUDY is motivated by the many signal extrac- T tion, parameter estimation, and automatic control al-

gorithms which utilize matrix/vector operations. When dimensionality is large or iterations are numerous or the application requires real time processing, the speed of the basic matrix operations is a primary concern.

Several authors (see, for example, [1]-[3]) have pointed out that matrix operations can be performed using array structures. While the details of specific designs vary, the speed of matrix/vector operations is proportional to the size of the matrices involved. For n x n matrices a matrix/vector operation, using n multipliers in parallel, can be achieved in O ( n ) multiplication time units. For even moderately large n. T h s computational speed may not be adequate for real time applications.

It is of interest to take note of a special class of linear algorithms which permit computation in O ( 6 ) time. This class of algorithms includes several important members such as the DFT and the circulant matrices. In an earlier study [4] it has been shown that every linear algorithm can be decomposed into a finite sum of such special class algorithms. T h s decomposition is referred to as a concur- rent triple product, CTP, decomposition. Each CTP pro- vides the basis for speed versus hardware comparisons and for comparisons with conventional array processing.

The principle on which the CTP decomposition is based is illustrated by the following example. Let x be an n- dimensional vector with n = p q . Let X be a p X q matrix formed by mapping the vector on a two dimensional array

Manuscript received August 18, 1987; revised March 4, 1988 and July 18, 1988. This work was supported in part by SDIO/IST-ARO under Contract D24962-MA SDI. This paper was recommended by Associate Editor R. Ansari.

The author is with the Department of Electrical Engineering, Louisiana State University, Baton Rouge, LA 70803.

IEEE Log Number 8826272.

(for example using segments of x as columns of X ) . Define the linear operations

y = A x , Y = L X R .

Using conventional arrays the matrix-vector operation y = A x takes O ( p q ) time. Using similar arrays the matrix-matrix operation Y = L X R can be computed in O( p + q ) time [4]. The relative speed of the two types of linear operations (on the same data x versus X formats) is thus apparent. Clearly the triple matrix product can be up to fi times faster than the matrix-vector multiplication.

It was noted above that there are well-known cases in which both linear operations, A x and L X R , represent the same algorithm. One such case is the n points DFT algorithm. The triple matrix form can be derived using Good's prime factors technique [ 5 ] . Our interest, however, is with arbitrary linear maps and a more powerful version of the CTP representation. In particular we establish mini- mal fast CTP forms for arbitrary linear maps.

11. SOME RESULTS ON CTP DECOMPOSITIONS Let E" denote the usual Euclidean space (real or com-

plex). The data segments to be processed will be repre- sented as vectors x E E" where n corresponds to the number of points to be processed. A linear algorithm, A , operating-on the data points will be representable by a matrix, A, acting on the vector x . For simplicity we shall slur over the distinction between A , a. We also consider A to be square. Hence the algorithm is described as a matrix-vector operation, to summarize

(1) y = A x , x , y E E".

Let n = p q and assume a fixed (but arbitrary) reformat of n x 1 vectors in p x q matrices, in, short a congruence r : E" + EPxq. We continue the notation X = r [ x ] where convenient and let r - ' [ XI = x denote the obvious inver- sion of r . The matrices La E E P X p and Ra E E q x q , a =

1 . . . k , constitute a k term CTP decomposition of A E EnXn provided

A x = T-'[ i l L a ~ [ x ] R a ] , all x E E " . (2)

We write A = xaLa 0 R" with the meaning of (2). The results of [4] include the following. Lemma 1: Every matrix A E EPqxPq admits a k term

CTP decomposition with k < min { p 2 , q 2 } .

0098-4094/89/0400-0553$01.00 01989 IEEE

T H E I N S T I T U T I O N O F E L E C T R I C A L E N G I N E E R S

Page 2: Concurrent forms of signal processing algorithms

I II

554 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS, VOL. 36, NO. 4, APRIL 1989

For familiarity and to facilitate Section I11 we digress concerning the implications of Lemma 1. First we note that A = L 0 R, that is a single term CTP, iff A has the block partitioned form

rllL . . . r1pL

A = [ r i L rppJ (3)

where

= [ r l J l '

Similarly A = ZULU 0 R" iff A is the summation of matrices all partitioned in this form.

A second issue considered in [4] concerns the best ap- proximation of an arbitrary matrix with a single term CTP.

arbitrary and a priori matrices of an arbitrary dimension. Let L E Eqxq denote an arbitrary matrix and scalars {rzJ: i , j = l , . . - , p } alsobearbitrary.If A ~ E p q ~ J ' q i s t h e matrix with block partitions { AI,} and if R is the matrix with entries { rlJ} then the block partitions of A - L 0 R are given by

To review this result let {A,, E Eqxq: i, j = 1,. . ,PI be

AIJ - r l JL , i , j = l , . - - ,P- We shall need a dot product on the Eqxq and for this

several choices are appropriate. The Hilbert-Schmidt [6, sect. XI] is one possibility. It is more intuitive, however, to use the congruence 7-l: Eqxq+ Eq2 and access the natural (and equivalent) inner product on E fl .

To illustrate these conventions, and set the stage for later developments consider the matrix A, with block partitions { A,,}. The dot product between two blocks is given by

(Am,n, A,,,) = trA,f,A,,, = c aa.paa.p 1.1 m,n

where a$/ is the a, /3 entry of the block matrix AI, ,. In a following lemma we introduce the Grammian of the ma- trix blocks { A,, ,} using this dot product notation.

% B

Our attention turns then to the functional

J ( L , R , = CIIAIJ - r I J L l 1 2 * ( 5 ) IJ

In the following paragraphs we summarize the minimiza- tion of J with respect to L, R. This reveals also the structure necessary to solve the best k term CTP approxi- mation to arbitrary A.

We note first that L 0 R is magnitude homogeneous in both L and R. The constraint

llR(I2 = cr,2/ =1

can be imposed without loss of generality. This constraint resolves a companion ambiguity in the variational equa- tions. Lemma 2: The matrices L A , R A which minimize

J( L, R ) are given by

lJ

L A = Cr,:A,, ( 6 ) 1 . J

where T - ~ [ R A ] is a unit eigenvector, associated with a maximum eigenvalue of the matrix

In forming I? we use ordering on subscripts m,n and p, q induced by the congruence 7-l acting on R ". Exam- ples in a later section will demonstrate this point.

Corollary: The CTP realization CaLtRa is minimal for J provided { T - ~ [ R " ] : a = l ; - - , k } is an orthonormal set of eigenvectors affiliated with the k largest eigenvalues and

The proof of this corollary follows from the Hermitian nature of r. In short, the eigenvectors of r are pairwise orthogonal. The matrices R", RS are constructed from the ath and /3 th eigenvectors of r, respectively, and retain this orthogonality in the trace dot product. Thus the expansion ZaL" 0 R" has the convergence and best approximation properties one anticipates in an orthonormal expansion. (See Appendix for further details.)

We note that our partitioning of A can take the form of p 2 , q x q blocks or q2, p X p blocks, whichever is mini- mal. Consequently I' has p 2 (respectively, q 2, eigenvalues. The maximal order condition k < min { p 2 , q 2 } of Lemma 1 follows from this.

To round out our summary of existing results we take note of some developments in array processing. Our atten- tion is focused on the cylindrical arrays, discussed in [4], [7], [8], which are very compatable with the CTP require- ments.

Consider then the array depicted in Figs. l(a) and (b), which computes the 4 X 4 matrix-matrix product LX. The hexagonal figures denote local memory wherein elements of the matrix L are stored as indicated in the figure. The columns of X are transmitted down the longitudinal paths. Each node multiplies the longitudinal input by its scalar in register. This product is added to the input arriving along the transversal (spiral) path and retransmitted transver- sally. Each node retransmits unaltered the longitudinal sequence. Fig. l(a) depicts the computation at the start of the second step. Fig. l(b) depicts the computation at the start of the last step.

For convenience we assume our array functions syn- chronously. It may be verified by inspection that the sequences available on the transversal paths, at the bottom of the array, are the rows of LX. We note also that the top row nodes of the array complete their computations simul- taneously with the completion of calculation of the first row of LX by the bottom row nodes.

At the nth step, (here n = 4) we switch the array as follows (see Fig. 2(a)). The ( L X ) row sequences are cycled around on the transversal paths (now closed orbits). The X row sequences are followed by R row sequences on the longitudinal paths. The node function changes as the new

il - T

n I

Page 3: Concurrent forms of signal processing algorithms

PORTER: SIGNAL PROCESSING ALGORITHMS 555

41

0- scalar lij i n r e g i s t e r

(a)

= '3lxlI + 133x31 + '3LX41

61 I 1 ' '4ZX21 + '64'41

' - l l l x l l + 112x21 + 113x31

= ' 2 Z X 2 1 + 123x31 + 124x41 side in the node memories at the 2nth step of this se- quence. The i , j entries on the nodes of Fig. 2 denote the final location of (LXR),,.

8'1 x

(b)

Fig 1.

111. MULTIMATRIX PRIMITIVES In this section we introduce a generalization of the

representation L o R. For ths we continue the conven- tions, E R", x E R P X P , and x = .[XI where n = p 2 , introduced in Section 1. Let xl,. . . , x p denote the columns of X . Let L,, . . . , L, be arbitrary p x p matrices. The map 9 on Rpxp is computed by

computation sweeps down the array. The new node func- tion now retransmits all input sequences unchanged while iteratively computing the dot product of these sequences. The dot product is retained at the node memory in the indicated fasluon. The switch in function of the nodes propagates down the cylinder together with the first arrival of LX and R data. Fig. 2(b) depicts the situation as the computational wave front reaches the thrd row. It is

(7) verifiable by inspection that the components of LXR re- Y [ x , . - . x , ] = ( L , x , ; . . , L,x,].

~ -~

T H E I N S T I T U T I O N O F E L E C T R I C A L E N G I N E E R S

Page 4: Concurrent forms of signal processing algorithms

556 IEEE TRANSACTIONS ON CIRCUITS A N D SYSTEMS, VOL. 36, NO. 4, APRIL 1989

x2 x3 x24 34

I

Fig. 3. Computing h. Fig. 4. Computing Lix,.

-

n I

In the spirit of Section I1 we also define L? 0 R using the formula

( Y o R ) ( x ) = ( p X ) R .

It is intuitive that Y 0 R has the capability of representing a broader class of matrices. We shall explore this in Section IV. In this section, however, we demonstrate that 9 X can be computed at the same speed as LX. Indeed, with the same number of multipliers and in fact the same architecture.

3. I . The Cyclic Cylindrical Architectures For clarity and brevity we use the context of an example

to present the crux of the development. In Fig. 3 we recall the architecture of Fig. 1. The hexagonal figures denote distributed memory (at the nodes) and the l,, identifies the components of the matrix L stored at the location in question.

The components x1,x2,xg,x4 of the vector x are en- tered at the top of the cylinder and are transmitted later- ally down the cylinder. Each node on the cylinder multi- plies the lateral input, x , by its stored value, l,,. This product is added to the variable received on the transversal (spiral) path. The augmented partial sum is then transmit- ted on the transversal path. It follows by inspection that the components of Lx constitute the results of this pro- cessing and are available at the bottom row of the cylinder.

Suppose then that each local memory is expanded to a memory queue. The ijth memory queue contains the I , , component from each of the matrices L,, L,, L,, L,, re- spectively. The first wave of computations use the compo- nents of x1 and L,. As the first computation is completed at the top row of the cylinder the vector x , is input and the L, data are used from the local memory. This change propagates down the rows as the first computation wave passes through the each node. The process is repeated for (x,, L3),(x4, L4). At the fourth step, for example, the data status is depicted in Fig. 4. On that figure, Zb refers to the

Fig. 5 . Computing Lixi .

ijth element of L, while x l J refers to the j t h element of x,, (i.e., the ij element of X). The computations Llx, complete in sequence at the bottom row of the cylinder.

Our first example has emphasized the wavefront aspects of array computing. As a consequence the computing nodes do not start simultaneously, however, 1/0 occurs at very convenient locations and may be cycled around, as described in Section 11, to facilitate right multiplication with the matrix R . Our second example demonstrates a 100-percent usage of all computing nodes and simultane- ously initiation and completion of all the L,x, calculations.

Each node memory is now a revolving memory queue with data ordering 1; . - . l,t',l:J . - . ';-I. At t = 1 the queue are set as in Fig. 4. Similarly the x l J data are inserted at t = 1 as depicted in Fig. 4. The longitudinal and transversal paths are now closed as depicted in Fig. 5 (only 2 longi- tudinal and 2 transversal orbits connected). The nodes function as before and the Llxl calculations start simulta-

Page 5: Concurrent forms of signal processing algorithms

PORTER: SIGNAL PROCESSING ALGORITHMS 551

- 1 1 1 1 1 1 1 1 1 1 w3 w6 w w 4 w7 w 2 w 5 w 8 1 w 6 w 3 w 2 w 8 w 5 w 4 w w7 1 1 1 w 3 w 3 w 3 w 6 w 6 w 6 1 w3 w 6 w 4 w7 w w 8 w 2 w 5

1 w 6 w 3 w 5 w 2 w 8 w w7 w 4 1 1 1 w 6 w 6 w 6 w 3 w 3 w 3 1 w3 w6 w7 w w 4 w 5 w 8 w 2

-1 w 6 w3 w 8 w 5 w 2 w7 w 4 w -

neously. The U X computations finish simultaneously however with one component of U X residing at each node.

Remark: In the preceding discussion flexible computing capabilities have been attributed to the array computing nodes. These capabilities, of course, imply added hardware complexity. The revolving memory queue capability im- plies commensurate register capacity (if L is 6 x 6 each register queue requires 6 capacity). The ability to dynam- ically switch the array implies tag bits (or equivalent software control) and affiliated node logic. These and other hardware implications (the node design, etc.) all require careful evaluation in any specific design process.

Example: To illustrate the representation 8 0 R we turn to the familiar DFT. For n = 9 we use the sequence x(O), x (1) ; . ., x ( 8 ) and its transform X(O), X(1); . e , X(8) to assist in the representation. Using the notation W =

exp { j 2 ~ / 9 } the DFT has the matrix representation

.

Anticipating the FFT format we have used a modulo 3 rearrangement to the form (x(O), x (3) , x(6), x(l), x(4) , x (7) , x ( 2 ) , x (5) , x (8) ) . After the induced column permuta- tions on the input sequence and with K , M denoting the 3 x 3 matrices

M = [’ 0 w O :] 0 0 w 2

it is readily verified that the 9-point DFT matrix takes the block partitioned from

which is exactly the ( 2 X ) R form with L, = K , L, = M K , L, = M 2 K and also R = K. Thus we see that the U 0 R representation emulates the spirit and substance of the classical decomposition of the DFT into the composition of DFT’s of lower orders.

3.2. The General Primitive The final phase of the current development carries the

L -+ 9 generalization to its natural conclusion. In short, since right matrix multiplication is just left multiplication with suitable transpositions we anticipate the extension R + 9 which is the transpose of our previous extension.

Indeed, let Z, R I ; . e , R , denote p X p matrices. Let zl; . ., z , denote the rows of Z. The matrix 29 is con- structed by using the row vectors z , R , , i = 1; . ., p as the rows of 2.92. In short

and consequently ( 8 X ) W and 8(XW) are both well defined.

It should be noted that ( 8 X ) W # U( XW). T h s can be verified in even 2 x 2 examples. As a consequence the notation U 0 W is ambiguous. This we resolve in the following manner. First, L 0 W ++ L( X 9 ) and U 0 R ++

( 8 X ) R , that is the script notation takes precidence. As to 8 0 9 we shall arbitrarily give precidence to the left-hand map. In this spirit we shall also write

(9 0 W)(x) = 9 ; p . [ x ] W . (9)

Although we do not pause to do so here, it can be demonstrated that L o R , U o R , L O W ’ , and U0 R all compute at the same speed, with the same number of multipliers and the same archtectures. The increase in generality is at a cost only in complexity of local memory.

IV. A COMPARATIVE EXAMPLE Our development has provided a family of representa-

tions, whch may be used as building blocks in CTP expansions of arbitrary linear maps. Each of these rep- resentations computes at the same speed and utilizes the same number of multipliers. The ordering L 0 R c 8 0 R c 8 0 9 is appropriate for generality and affiliated memory complexity. It is also appropriate for the represen- tational ability of the classes in question.

In the present section we consider an example in detail. This example, although simple in form, demonstrates with clarity the relative representative abilities of the three forms. Consider then n = 4 and input vector ( x l , x2, x3, x4). The matrices L, X , R are given by

~ ~ _ _ . - _ _ _

T H E I N S T I T U T I O N O F E L E C T R I C A L E N G I N E E R S

Page 6: Concurrent forms of signal processing algorithms

558 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS, VOL. 36, NO. 4, APRIL 1989

The matrix LXR is given by

('2lX1+ f22x2)r11+(f21x3+ f22x4)r21 ( f 2 1 x 1 + 1 2 2 x 2 ) r 1 2 + ( f 2 1 x 3 + '22'4lr22 1 . ( f l l x l + 112x2)r11 + (f11x3 + f12x4)r12 ( f l l x l + f12x2)r12 + ( l I l x 3 + f12x4)r22 LXR =

The equivalent linear map y = Ax, where y = T - ~ [ LXR], is given by

~11'11 42'11 111r21 4 2 r 2 1

0 1 2 4 2 r 1 2 '11'22 42f-22

I21r12 122r12 I21r22 122r22

(10) L O R = [ ~ 2 1 r 1 1 f22r11 121'21 I22r21 1.

Consider now the map 9 0 R, noting that L c 9 is the dual with similar properties. For clarity of notation we shall use L, M in lieu of L,, L2. Repeating the above development it follows that

1 ( f l l x l + f12x2)r11 f(m11x3 + M12X4)rZ1

(I21Xl + f12x2) '11 -k ( Ilt2lX3 + (lllXl + 112x2)r12 + (III1lX3 + m12X4)r22

22x4) r21 ( I2lX1 + f22x2) '12 + ( 2lX3 + 2ZX4) r22 YXR =

and that

'21r11 l22r11 m21r21 m22r21

L r 1 2 '12r12 m11r22 m12r22

I2lf-12 122'12 m21r22 m22r22

Y o R =

Our next generalization considers Y 0 9. For clarity of notation we continue with Y - L, M and use also 9? -+

R, S. Repeating the basic calculation it follows that

( 1 1 1 ~ l + ' 1 2 x 2 ) r l l + ( m l l x 3 + "12x4)r21

(121X1+f22X2)S11+(M21X3 + m22x4)s21

('llXl+ f12x2)r12+(m11x3 + m12x4)r2Z

('2lX1+ f22X2)S12+(m21x3+ m22x4)s22 2x9 =

The representation for Y 0 9 follows easily, namely

L'll 42'11 m11r21 "12'21

L r 1 2 I12r12 m11r22 m12r22 ] (12) i '21312 122312 m 21322 m22322

= '21311 ~ 2 2 3 1 1 m21321 m22321

Some comparisons of (10)-(12) are readily apparent. In (10) we have available only the 8 scalars of L, R. The general 4 x 4 matrix, of course, has 16 scalars and hence not every matrix can be calculated by L 0 R. For, example suppose A is a 4 X 4 matrix, with entries a,, and L 0 R = A. Then using (10) we see that

all = Ill'll; a31 = '11'12

= '21'11; a41 = '21'12

and hence ~ 1 1 / ~ 3 1 = a21/a21 is a necessary restriction on A. In (11) we have available 12 scalars. More matrices are of the form 9 0 R but again the 16 coefficient equations cannot always be met. In (12) we have 16 scalars which matches up with the 16 coefficient equations. The struc- tural constraints of Y 0 9, however, precludes a general solution. Indeed the condition 0 ~ 1 1 / ~ ~ 3 1 = 0 ~ 2 1 / ( ~ 4 1 remains necessary for a matrix A to be of the Y 0 9 form.

We turn our attention now to the best approximation problem summarized in Section 11. In particular, Lemma 2 identifies the optimal L ', R such that (IA - L 0 R1I2 is minimized. It is instructive to compare results of this nature for the three representations under consideration.

For this let A be an arbitrary 4 x 4 matrix with the natural 2 x 2 block partitions { A t , } . We first list the concrete form of the respective approximation functionals:

IIA - L O RI2 = 1141 - r11LIl2 + 1 1 4 2 - '21L1I2

IIA - 3 O R1I2 = 1141 - 'llL1I2 + 1 1 4 2 - '21Ll12

+ v 2 1 - '12L1I2 + IIA22 - '22L1I2 (13)

+ M 2 1 - r 1 2 M I l 2 + lIA22 - r22M1I2 (14) while

IIA - 9 0 9 \ 1 2

= IlP1(41- r11L)Il2 + llPl(& - r21L) I l2

+ I l P d A 2 1 - '12M)112 + I I P l ( A 2 2 - ' 2 2 W I l 2

+ I IP2(A11- SllL)ll2 + l l P 2 ( 4 2 - s12L) I12

+ I l P 2 b 4 2 1 - 312M)112 + l P 2 U 2 2 - 322M)112. (15)

In (15) the notation P, denotes using only the j t h row of the matrix in question in computing the norms. It is evident that the minimums of the three expressions satisfy the ordering

1(A - L A 0 R A [ I 2 >, IIA - Y A 0 R A [ I 2 >, I(A - 3 A 0 9 A ] I 2 . Concerning the optimal forms, a comparison of (13)-(15)

with the contents of Lemma 2 yields the following.

il - T

n I

Page 7: Concurrent forms of signal processing algorithms

PORTER: SIGNAL PROCESSING ALGORITHMS

Case I, L A 0 R A : We have the following. (rl:, r&,

r2c, r2; ) is a unit eigenvector of the 4 x 4 matrix

1 ( A l l ? Al l ) . . . ( A l l , '422)

( A 2 2 9 A l l ) . . . ( A 2 2 3 A 2 2 )

r = [

associated with a maximum eigenvalue. The matrix L" is computed by

4: = P-*pl,.

J ( L " , R " ) =CA,

The function J has the minimal value 3

1

where the A , are the three smallest eigenvalues of r. Case ZI, 9 " 0 R : Here ( rl: , r2: ) and ( rl;, r2; ) are unit

eigenvectors associated with maximum eigenvalues of the two respective matrices:

(All7 4 1 ) ( A l l , 4 2 ) (A215 A21) (A219 A 2 2 )

( A 1 2 ? 4 1 ) ( 4 2 > 4 2 ) (A219 A22) (A223 A 2 2 )

559

respectively. The matrices L ", M " are constructed by PIL" = rl:PIAll+ r2:P1A21;

P,M A = rl;PlA21 + r2; P1A2,

P2LA = S ~ P ~ A , , + S ~ P ~ A ~ ~ ; P2MA = s $ P ~ A ~ ~ + s & P ~ A ~ ~ .

The minimal value of J is given by

J ( 9 A , 9 A ) = A, + p, + a1 + 6, where these four scalars are minimal eigenvalues of the four above matrices, respectively.

V. COMMENTARY In t h s study emphasis is placed on connections between

matrix properties and concurrent array architectures. This emphasis recognizes that high throughput applications dic- tate parallel processing. Thus sequential processing crite- ria, such as the number of multiplications, are no longer relevant.

In Section IV we noted that ( p X ) 9 # p ( X9). Now that some concrete examples are in hand it is convenient to demonstrate this. Indeed, using methods similar to those employed above the reader may verify that

The matrices L ", M A are given by

L" = rl:All + r2:A21

M A = rl$A12 + r2;A22.

The minimal value of J is given by

J ( P " ? R " ) = h + p 1

where A,, p1 are the smallest eigenvalues of the two matri- ces gven above.

Case ZZZ, 9 " 0 9 ": Here ( rl:, r2: ), ( rl;, r2$ ) and (s; ~ s& ),(SA, si; ) are all unit eigenvectors associated with maxlmum eigenvalues of the four matrices:

I 1 I 1

(PIAll, '11) (P1A113 A12)

( P 1 4 2 , A l l ) (PlA129 4 2 )

(P1A213 '21) ('lA21> '22)

[ [ [

(PlA219 A22) (PlA22, A22)

(P2All9 Al l ) (P2All2 4 2 )

( P 2 A 1 2 9 4 1 ) (P2A127 4 2 )

(P2A21, A21) (P2A213 A 2 2 )

(P2A211 A 2 2 ) (P2A22, A 2 2 )

A direct comparison with the calculation ( 9 X ) g pre- sented earlier confirms the non-associative nature of the script operators.

In comparing cases 1-111 it is apparent that the left and right operations have somewhat different effects on the approximation problem. These differences, however, arise from the non-associative characteristic. That is the map 9 in ( 6 p X ) ( .) has the same effect as 9 in (e)( X9).

In Section 3.1 the DFT has been presented as an exam- ple of a case I1 fast form. In companion studies the discrete cosine transform [ll], the Wigner distribution [lo], the Hankle and Toeplitz matrices [9] and the Hilbert transform [9] have been similarly analyzed. It is of interest to note that such analytic techniques as frequency and time decimations are useful but not essential in establish- ing the fast form representation.

Concerning the CTP expansion of an arbitrary matrix we note that conventional spectral expansions involve ma- trices (projectors) of size n and in general have n terms. By comparison the Case I CTP expansion for example involves matrices of size 6, and has at most fi terms present.

APPENDIX In this Appendix we summarize the proof of Lemma 2

in section 11. For this it suffices to consider the real case. Thus L , R E Rpxp and T E R"'" where n = p2. The func- tional

J ( L , R ) =IIT-LoR112= tr{(T-L,R)(T- LOR)*}

is to be minimized with respect to L and K.

T H E I N S T I T U T I O N O F E L E C T R I C A L E N G I N E E R S

Page 8: Concurrent forms of signal processing algorithms

560

Note first that if L,, R, is a minimal pair then XL,, X-lR, is also minimal for all X # 0. This nonunique- ness is easily removed. The most convenient adjustment is to add the constraint

rz - 1 = 0. ( 1 ) ij

Recall that L 0 R has a natural partition with subblocks { q i L } . Partitioning T in the same fashion it follows by inspection that

J*(L, R ) = tr ( ~ ~ q y - q i ( q j ~ * + LT,T)+ risLL*) ij

where X is the familiar Lagrange multiplier. Since the constraint (1) does not involve L it is conve-

nient to use variational methods with respect to L first. The results of this variation take the form

SJ*(- ,R) = O - L = C r Z . . JI (3)

ij

To determine R a direct variation on J is appropriate, however, it is more convenient to substitute L of (3) in J( L, R) which yields

J ( L , R ) = tr xqjqj’ - C Zrji(trqjTk7)r,k (4) ( ij ij k,l

where constraint (1) has been imposed. The second term in (4) is a quadratic form (x, r x ) with r being the Gram- mian matrix of the partition blocks of T and x being a vector version of the matrix R, namely, x = T-,( R).

Since r is symmetric it follows that J(L, R) is mini- mized over IIT-~(R)~I =1 by choosing a unit eigenvector associated with the maximum eigenvalue of r. This com- pletes the proof of Lemma 2.

The corollary of Lemma 2 is also readily confirmed. For this we need two identities

(i) ( L o R ) ( M o N ) = ( L M ) o ( R N )

(4 t r (L0 R) = t r (L)tr(R).

Now let x, y be two orthogonal eigenvectors of r. Let R = 7-l(x) and N = T-,( v), then tr(RN) = (x. v) = 0,

IEEE TRANSACTTONS ON CIRCUITS AND SYSTEMS, VOL. 36, NO. 4, APRIL 1989

consequently

t r (L0 R ) ( M o N ) = O .

With this result in hand it is not difficult to confirm that the expansion of the corollary of Lemma 2 is an orthogo- nal expansion which is, moreover, in 1 : 1 correspondence with the spectral expansion of r.

REFERENCES [l] H. T. Kung, L. M. Ruane, and D. W. L. Yen, “A two level

pipelined systolic array for convolutions,” VLSZ Systems and Com- putations, pp. 255-264, Computer Science Dept. Carnegie-Mellon Univ., 1981.

[2] K. Hwang and Y. H. Cheng, “VLSI computing structure for solving large scale linear system of equations,” in Proc., Int. Conf. Parallel Processing, pp. 217-227, A ~ F 1980.

[3] W. A. Porter and J. L. Aravena, Cylindrical arrays for matrix multiplications,” in Proc. 24th Allerton Conf., Oct. 1986.

[4] J. L. Aravena and W. A. Porter, “Array based design of digital filters,” submitted to IEEE Trans. Acowt., Speech, Signal Process- ing.

[5] I. J. Good, “The interaction algorithm and practical Fourier analy- sis,’’ J. Royal Stat., Soc., ser. B 20 (1950) addendum 22 1960.

[6] Dunford, N. and J. Schwartz, Linear Operators. Part 11, 1963 New York: Wiley.

[7] J. L. Aravena and W. A. Porter, “Non planar switchable arrays,” Circuits, Syst. Si a1 Processing, v ~ l . 7, no. 2, p 213-234, 1988.

[8] W. A. Porter a n r J . L. Aravena, Orbital arcLtectures with dy- namic reconfiguration,” Proc. Inst. Elect. Eng., vol. 134, pt. E, no. 6, Nov. 1987.

[9] W. A. Porter, “Fast computation of banded maps,” in Trans. ComCon 88, Baton Rouge, LA, Oct. 19-21, 1988.

[ lo] -, “Arrays for quadratic signal processing,” in ISCAS Proc., Philadelphia, PA, May 1987.

[ l l ] H. S. Hou, “A fast recursive algorithm for computing the discrete cosine transform,” IEEE Trans. Acowt., Speech, Signal Processing, vol. ASSP-35, Oct. pp. 1455-1461, 1987.

william A. Porter received the B.S. degree in electrical engineering from Michigan Technological University, and the M.S. and Ph.D. degrees in electrical engineering from the University of Michigan.

Hc has served on the faculties of the University of Michigan and Southern California University and is currently Professor of Elect

and Computer Engineering at Louisiana State University, Baton Rouge. His current research interests include com-

puter architecture, signal processing, fast algorithmic forms, and distribu- . , .~ I_ . , . . . , . tive memory recognition devices.

T il -

n I