Fast forms of banded maps

I192 IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING. VOL. 38, NO. 7. JULY 1990

Fast Forms of Banded Maps

Abstract-In real-time applications, computation speed can be a critical factor in the success of signal processing algorithms. The systolic array provides the potential for computational parallelism, and the implementation of algorithms on such arrays has been widely studied. The present research is a continuation of earlier work which identified decomposition techniques for linear algorithms. The resultant concurrent triple product structure is tested here on the class of maps which are diagonally banded. The banded maps include Toeplitz, convolution, and Hilbert transform operations, each of which is considered in the study.

I. INTRODUCTION UMEROUS earlier studies have considered the ques- N tion of fast and hardware efficient computation to sig-

nal processing algorithms. While much progress has been made, the speed of computation remains of critical im- portance in real-time applications where dimensionality is large or sample rate is substantial or the algorithm in question requires iteration.

Several authors (see, for example, [ 11, [2], [ 113) have pointed out that matrix operations can be performed using array structures. While the details of specific designs vary, the speed of matrix/vector operations is proportional to the size of the matrices involved. For n x n matrices, a matrix/vector operation, using n multipliers in parallel, can be achieved in O( n ) multiplication time units. For even moderately large n , this computational speed may not be adequate for real-time applications.

In earlier studies [3]-[5], it was noted that some algorithms have a special structure which supports a much faster computation than the general case; the FFT being the best-known example of the structure in question. Moreover, it has been shown that every linear algorithm can be decomposed into a finite sum of such special class algorithms. This decomposition is referred to as a concurrent triple product (CTP) decomposition. Each CTP provides the basis for speed versus hardware comparisons and for comparisons to conventional array processing.

The principle on which the CTP decomposition is based is illustrated by the following example. Let x be an n-dimensional vector with n = p4. Let X be a p x q matrix formed by mapping the vector on a two-dimensional array (for example, using segments of x as columns of

Manuscript received October 26, 1987; revised July 20, 1989. This work was supported in part by SDIO/IST under Contract 24962-MA-SDI admin- istered by the U.S. Army Research Office, Durham, NC.

The author is with the Department of Electrical and Computer Engi- neering, University of Alabama in Huntsville, AL 35899.

IEEE Log Number 9035648.

X ) . Define the linear operations

y = A x , Y = L x R .

Using conventional arrays, the matrix-vector operation Y = Ax takes O ( pq) time. Using similar arrays, the matrix- matrix operation Y = LXR can be computed in O ( p f 4 ) time [4]. The relative speed of the two types of linear operations (on the same data x versus X formats) is thus apparent. Clearly, the triple matrix product can be up to O ( ) times faster than the matrix-vector multiplication.

It was noted above that there are well-known cases in which both linear operations, Ax and LXR, represent the same algorithm. One such case is the n points DFT algorithm (see also [lo]). Our interest here, however, is with another class of matrices, denoted banded matrices. Our interest is motivated by the Hilbert transform, the Toeplitz matrix, and the convolution matrix, each of which is discussed in later sections. Several other banded matrices can also arise, and hence a general analysis of this class of matrices is in order.

In Sections I1 and 111, we focus attention on the CTP and its interrelationship with array architectures. In Sec- tion IV, the concept of computational reassignment is in- troduced. This technique takes advantage of matrix spars- ity to simplify the CTR expansion. In Section V, while considering the Hilbert transform, we show that computational reassignment in the limit becomes a complete re- organization of the algorithm. Thus, the CTP decomposition may be viewed as a family of techniques.

11. BANDED MAPS The class of maps referred to here as banded maps in-

cludes several special cases which are familiar to system theorists. To reinforce this familiarity, we present several examples before proceeding with the analysis. For con- venience of comparisons, we consider matrices acting on E” with n = 4 .

Our first example is the Toeplitz matrix of Fig. 1. The entries u l , - - , u4 are from the underlying field. The terminology “banded” is motivated by the obvious diagonal pattern of the matrix. In Fig. 2 we display a banded matrix with 3 nonzero diagonals. This matrix is an obvious specialization of the Toeplitz matrix. Other examples arise from setting subsets of scalars a, to zero. In Fig. 3, the convolution matrix is displayed.

In Fig. 4 we display a nonsquare banded matrix. This example is a natural extension of the convolution matrix. In Fig. 5 we display a skew Toeplitz matrix. The matrices

0096-3518/90/0700-1192$01 .OO 0 1990 IEEE

PORTER. FAST FORMS OF BANDED MAPS

- a l “9 a8 a7 “6 “5 a4 ‘3 a2 a2 “1 ” 9 “8 a7 a6 a5 a4 a3 “3 “2 a l a9 a8 a7 a6 a5 a4

a4 a3 a2 ‘1 ’9 a8 ’7 ’6 a5

a5 a4 a3 “2 a l “9 ‘8 ‘7 a6

a6 “5 a4 ‘3 ‘2 a l ‘9 ‘8 a7 a7 a6 a5 a4 a3 a2 a l a9 a8 a8 ”7 “6 “5 ”4 “3 “2 “1 a9

a9 “8 “7 a5 a4 a3 a2 a l -

I I93

7

“2 a3 a4 a5 aJ a8 “9 a l a3 ‘4 a5 a6 ‘7 a8 a9 a2

a4 a5 a6 ‘7 a8 “9 a l a2 a3 “5 “6 a7 a8 “ 9 a l “2 ‘3 ‘6 ‘7 “8 a9 a l a2 a3 ’4 a5 a7 ‘8 a9 a2 ‘3 ‘ 4 a5

a8 a9 a l a2 a3 ‘ 4 a5 a6 a7

“9 a l a2 ’3 a4 “5 a6 a7 a8

a l “2 a3 ’4 a5 “6 a7 a8 ‘9 -

Fig. 2 . The banded matrix

0 0

Fig. 3. The convolution matrix.

of Figs. 2-4 also have skew counterparts. While our analysis here is focused on the banded matrices, the skew banded counterparts are computable by identical techniques. With these examples in hand, we formalize two concepts.

Dejinition: A matrix A E E“ ‘ I is banded (respectively, skew banded) if every diagonal (respectively, skew diagonal) contains only copies of a single scalar. A matrix A E E” is banded (respectively, skew banded) if every square subblock is banded (respectively, skew banded).

In particular, we shall say that B is the skew form of banded A if B is skew banded, and with identical diagonals to A preserving the order displayed in Figs. 1 and 5 .

Consider now an arbitrary matrix A E E“ ’I and the as- sociated matrixhector computation

y = Ax, y , x E E ” .

We assume n = p q for integers p , q and expand the matrix A with zeros if necessary to satisfy this assumption.

The symbol T denotes an isometry T : E ” -+ E P X q . To define T, let x be segmented, namely, x = ( x , , * , x y ) where x i E E P . We then construct the matrix X using these

- 0 O

a2 a l O

a3 a2 a1 O a4 a3 a2 a1 O

a5 a4 a3 a2 a1 O a6 a5 a4 a3 a2 a l 0 0 0

a7 a6 a5 a4 a3 a2 a l 0 0

“8 ”7 a6 a5 ’4 a3 ‘ 2 a l a9 a8 ‘7 a6 ‘5 a4 a3 aZ al

a9 a8 a7 a5 a4 ’3 a2

‘9 a8 ‘7 ’6 a5 ‘4

a9 ‘8 a7 a6 ‘5

“9 a8 a7 ’6

0 0 ag a8 a7 a6 a5 a4 a3

O a9 a8 a7

O a9 a8 O “9

0 -

Fig. 4. The extended convolution matrix

segments as columns. The resultant map T(X) -+ X is called the segmentation isometry.

Consider then matrices L, E E p x p and RI E E q x q , i =

1, * * . , k where k is arbitrary. The set { ( L r , R I ) } is called a concurrent triple product (CTP) representation for A provided

k

y = Ax ++ ~ ( y ) = L r ~ ( x ) R r . (1) I = I

Adopting the natural tensor notation, we write b

A = c LioRr r = l

with the meaning of (1). In companion studies [4], [9], the CTP is studied in

some detail. It is known that every matrix has a CTP form. Methods for minimizing the number of terms in the CTP expansion have also been established. There are also more general forms than L,oRi that can be used in the CTP expansion of A .

The CTP expansion results are of interest for two basic reasons. First, the form LOR is provided for uniform data flow, a natural parallelism, and an ordering of computations. There is also a natural connection with array architectures which can implement the LOR form with a high degree of efficiency and commensurate computational speed.

1194 IEEE TRANSACTIONS ON ACOUSTICS. SPEECH, A N D SIGNAL PROCESSING, VOL. 38. NO. 7. JULY 1990

‘23 :32 x33 :33

‘13

Fig. 6. Stream driven computation T = XR.

Fig. 7 . Switched array computation of LT.

For the present purposes, one LOR example should suf- fice. Since Figs. 1-4 are all n = 9, we choosep = q = 3 , in which case L , R E E 3 ’.

In Fig. 6 we depict an array performing the XR product. The subscripts on the x variables denote the location in the original vector x E E”. The subscripts on the r variables denote row/column location in R . The nodes of the array accumulate the dot product of the two respective data sequences available at the node. The notation tu iden- tifies the element of T = XR being computed at the respective nodes.

In Fig. 7, the array is switched to accommodate the L ( X R ) calculation. As the last entry of the X , R data streams passes through each node, the function of the node is switched. Thus, a switching “wave” travels down the array in synchronism with the last X , R entries. In the switched form, tu is retained in memory. The transversal signals are set at zero. The L row sequences are entered on the paths originally assigned to the X data. Each node now multiplies its tu value by the incoming L sequence and adds this to the partial sum arriving on the other chan- nel.

111. SOME LOR DECOMPOSITIONS Consider then the banded matrices of Section 11. In each

case, we use the segmentation isometry forming 3 x 3 matrices L , R and similarly partitioning the original A matrix into 3 X 3 blocks.

Concerning a matrix representable as A = LOR, it is easily verified that A has a partitioned block structure, indeed

A = “‘1 [ r l ; L

Ll = [ ;; :] L2 = [ 1; 1; %J

* . *

r l ,L - ‘ 4 4

When A is banded, identical partition blocks occur in an obvious manner, e .g . , the diagonal blocks are identical. Thus, a CTP expansion is often obtainable by inspection. This is the case with the examples we consider here.

Consider first the Toeplitz matrix of Fig. 1. By inspection, we see that an equivalent three-term CTP is available determined by

a9 a8 a4 a3 a2

1 0 0 0 1 0

RI = [: ;] R 2 = [; : ;]

We note that the Ri matrices perform only permutations on the columns of X . The array of Fig. 6 will certainly do this, however, alternative circuitry may do this as well. This pattern is repeated in all our examples.

The skew Toeplitz matrix of Fig. 5 is also amenable to a CTP representation. Indeed, it can be verified that the appropriate components are identified by

a2 a3 a4

PORTER: FAST FORMS OF BANDED MAPS 1195

We note that the connection between the Toeplitz matrix and its skew counterpart is continued in the CTP for these two examples. Indeeed, L: is the skew of L,, and R: is the skew of R,, i = 1, 2, 3 .

The above relationship is in fact generally true, and we formally recognize this as the following result.

Result I : Let B be the skew of banded A . If { ( L, , R, ) } is a CTP for A , then { (L,', R,' ) } is a CTP for B where the prime denotes skew form. Since ( L ' ) ' = L , the property is symmetric.

To prove this result, it suffices to consider an arbitrary banded A E E n X " and its skew form A ' . Imposing the block partition structure on A , A ' , the result follows by inspection.

The matrices of Figs. 2 and 3 yield readily to the same type of investigation that was demonstrated above. We note that the all zero diagonals in these examples result in more zero's in the L, and R, matrices, however, three terms in the CTP still seem to be necessary. In some cases, however, sparcity can be used to specify a more compact CTP. We take up this matter in the next section.

I v . COMPACTION BY COMPUTATIONAL REASSIGNMENT In this section we describe a technique which may be

useful for compacting the computation of sparse matrices. This technique, referred to as computational reassignment, is valid for nonbanded and banded matrices alike. Computation reassignment is based on the fact that all scalars in a column of a matrix multiply the same entry of the vector on which that matrix operates. Thus, if a matrix column has a zero entry, it is feasible to relocate nonzero entries, in the same column, provided the resultant product is then properly relocated.

To illustrate computational reassignment, consider the matrix-vector operation

We note that all scalars in the second column multiply x2. Thus, the diagonal matrix operation

A'x = 0 a2 0

L o 0 a3 /

has the same scalar operations and differs only in the use of these products (namely, u2x2 is used in the second row rather than the third row summations.)

If A 'x is computed sequentially by row from top to bot- tom, then the diagonal form requires only 1 multiplier. If we implement A' but route the a2x2 to the third row sum, then Ax is equally convenient. As a more pertinent illus- tration, consider the matrix of Fig. 2 with the additional assumption a2 = 0. It is evident by inspection that

0 a3 a1

is a CTP for this example. Suppose then that the u3 in the fourth and seventh rows are relocated to the first and fifth rows. We then have a one term CT, namely

L = [: a3 1: a l a? R = [i where = a3 when acting on col(xl x2 x3), 4 = a3 when acting on col(x4, x5, x6), and 4 = 0 when acting on col(x7, x8, xg). In this example, the Ox3, which occurs early in the flow of computations, is replaced by a2x3 which is otherwise supplied by L20R2. We route u2x3 to proper summation to restore the integrity of the calculation.

In Figs. 8 and 9 we display two examples of the above displacement. In Fig. 8 , the matrix A is fixed in node reg- isters. The computational reassignment is implemented entirely at the a I 3 node. All that is required is to delay each scalar product by one time unit before adding to the partial sum operation. In Fig. 9 we show the A matrix data flowing on the longitudinal links. The reassigned calcu- lations must therefore flow in snychronism.

In Figure 9, the x3* node calculates a13x32 and ships the product to the x33 node. The x3? node adds the a13x31 product to the incoming partial sum. We note that Figs. 8 and 9 have different interconnections. This, however, is dictated by flow of the data through the array.

Returning now to the matrix of Fig. 2, we note two possible reassignments, in addition to those described above: the relocation of a2x4 and a2x,. These relocations reduce A to a block diagonal matrix with the block

L = ial a3 a l a2 "1 . Our earlier analysis extends to this case as well.

1196 IEEE TRANSACTIONS ON ACOUSTICS. SPEECH. AND SIGNAL PROCESSING, VOL. 38. NO. 7. JULY 1990

'13 '23 x33 An investigation of properties i) and ii) reveals the 5 2 x22 '32 banded structure of T. This structure is depicted by an

example, and we choose N = 16 which is shown in Fig. 10. (In Fig. 10, all entries are in units.) In this figure we see a relatively sparce matrix, with half of the entries of each column and row consisting of zero elements.

The case N = 16 is typical of all N-even ( N / 2 odd) examples. We note that the patterns of zero entries in ad- jacent rows are exact complements. The DHT therefore provides an example where relocation is fruitful. As a first step, row n + I can be superimposed on row n producing the resultant matrix displaced in Fig. 11.

Even further compaction is possible. Notice that each column has identical entries. Moreover, half the entries are negatives of the other half. To take advantage of this observation, we segment the input vector x into 4 tuples and identify the matrix ~ ( x ) ,= X in the usual style. In addition, we form the matrix T, given in Fig. 12.

The operation f'X would contain the same multiplica- a21 922 O23 tions (except for sign) as Tx, but would contain entirely

different summations. Thus, we form an array of multipliers with tj stored at the nodes. The data X flows across the multiplier array. The results of each multiplication are transmitted to a connection network, which supplies the proper sign and assembles the summations inherent with Tx .

Fig. 8. A computation displacement at a node.

a33 '31 '32

VI. DISCUSSION Banded matrices, when partitional, have a natural re-

petitive structure. At a minimum, all the subblocks in the same block band are identical. This repetitive structure can be used to advantage in designing architectures which emphasize concurrent realizations of computations in- volving the matrix in question.

Where sparcity exists, the CTP may possibly be sim- plified by computation reassignment. With substantial Fig. 9. Computation displacement with forwarding.

Remark: Since computational reassignment implies a nonuniform array structure, it obviously needs to be used with digression. If this reduction in hardware does not justify the attendant complexity in the data flow, then the computational reassignment should not be utilized. In the next section, we present an example where computation reassignment does retain the uniform computation characteristic.

V. THE HILBERT TRANSFORM The discrete Hilbert transform has a distinctive banded

form which we now consider. Because the DHT has drawn the attention of several authors ([6]-[8], for example), we shall give only the briefest introduction.

Let N be even and the DHT depicted as a map, T on R N . The map T has entries Tu, i, j = 0, * * , N - 1 determined by

L 2 T,j = - sin [ 2 7 r ( j - l ) k / N ]

N k = l

sparcity, such as some zero bands, the computation reassignment concept can lead to a compaction of the map in question. Thus, systolic array processing and array multiplication followed by connection network processing are seen as distinct members of the same architectural family.

In our development, direct use was made of the banded characteristic in determining the CTP representation. When the matrix in question is arbitrary, the CTP expansion requires more substantial tools. For perspective we summarize these briefly.

Let T E E" x n denote an arbitrary matrix. Let 11 T 1 1 2 =

tr TT* denote the Hilbert Schmidt norm on T. Consider the problem of choosing L E E p x p and R E

EY Y, such that I[ T - LOR \ I 2 is minimized. More generally, if { L;, R ; } is a collection of maps and

A = c LioRl, i

then the problem

min 1) T - A I J ~ { L1.R } where L = N / 2 - 1. It is readily shown that

i) TI, = -T,,, T,, = 0 ii) TJ = - T ( l + l ) , ( J + l ) '

is at the heart of the CTP approximation concept.

PORTER: FAST FORMS OF BANDED MAPS 1197

0

628

0

187

0

084

0

025

0

-025

0

-084

0

- 1 i 7

0

-628

628

187

084

025

-025

-084

-187

-628

-628 0

0 -628

628 , 0

0 628

187 0

0 187

084 0

0 084

025 0

0 025

-025 0

0 -025

-084 0

0 -084

-187 0

0 -187

-628 -628

628 628

187 187

084 084

025 025

-025 -025

-084 -084

-187 -187

-187 0

0 -187

-628 0

0 -628

628 0

0 628

187 0

0 187

084 0

0 084

025 0

0 025

-025 0

0 -025

-084 0

0 -084

-187

-628

628

187

084

025

-025

-084

628 025 084 187 628 025 084 187 628 025 084 187

-187

-628

628

187

084

025

-025

-084

I87

025 62 8

084

-084

0

-187

0

-628

0

628

0

187

0

084

0

025

0

-025

0

0

-084

0

-187

0

-628

0

628

0

187

0

084

0

025

0

-025

-025

0

-084

0

-187

0

-628

0

628

0

187

0

084

0

025

0

0

-025

0

-084

0

-187

0

-628

0

628

0

187

0

084

0

025

025

0

-025

0

-084

0

-187

0

-628

0

628

0

187

0

084

0

0

025

0

-025

0

-084

0

-187

0

-628

0

628

0

187

0

084

Fig. I O . The 16 points Hilbert transform

084

0

025

0

-025

0

-084

0

-187

0

-628

0

628

0

187

0

-084 -084 -025 -025 025 025

-187 -187 -084 -084 -025 -025

-628 -628 -187 -187 -084 -084

628 628 -628 -628 -187 -187

187 187 628 628 -628 -628

084 084 187 187 628 628 - 025 025 084 084 187 187

-025 -025 025 025 084 084

Fig. 11. The compacted Hilbert transform.

Fig. 12. A Hilbert transform kernel

In [4] and [9], the above minimization problem is considered in detail. It has been shown that the minimal CTP expansions have most of the characteristics of a Fourier series expansion including a biorthogonality of the opti- mal { L,oRi} elements. It is shown further that the CTP expansions converge finitely to the matrix in question.

REFERENCES [I ] H . T. Kung, L. M. Ruane, and D. W . L. Yen, “A two level pipelined

systolic array for convolutions.’. VLSI Systems and Computations, Comput. Sci. Dep. Carnegie-Mellon, 1981, pp. 255-264.

[2] K. Hwang and Y . H . Cheng, “VLSI computing structure for solving large scale linear system of equations.” in Proc. Int. Conf. Parullel Processing, Aug. 1980, pp. 217-227.

[3] W . A. Porter and J . L. Aravena, ”Cylindrical arrays for matrix mul- tiplications,” in Proc. 24th A//ertori Conf., Oct. 1986.

[4] J . L. Aravena and W. A. Porter, “Array based design of digital f i l - ters,” in Proc. ComCon 88, Baton Rouge, LA, Oct. 19-21, 1988.

[SI W. A . Porter and J . L. Aravena. “Orbital architectures with dynamic reconfiguration,” Proc. Insr. Elec. Er7g.. Part E, vol. 134, no. 6 . pp.

[6] V. Cizek. “Discrete Hilbert transform,” IEEE Trans. Audio Elec- 281-187, N O V . 1987.

froarmst. . vol. AU-18. pp. 340-343, Dec. 1970.

084

025

-025

-084

-187

-628

628

187

0

084

0

025

0

-025

0

-084

0

-187

0

-628

0

628

0

187

084

025

-025

-084

-187

-628

628

187

187

0

084

0

025

0

-025

0

-084

0

-187

0

-628

0

628

0

187

084

025

-025

-084

-187

-628

628

0

187

0

084

0

025

0

-025

0

-084

0

-187

0

-628

0

628

187

084

025

-025

-084

-187

-628

628

628

0

187

0

084

0

025

0

-025

0

-084

0

-187

0

-628

0

628

187

084

025

-025

-084

-187

-628

F. Bonzanigo, “A note on discrete Hilbert transform,” IEEE Trans. Audio Electroacousf., vol. AU-20, pp. 99-100, Mar. 1972. F . E. Burris, “Matrix formulation of the discrete Hilbert transform,” IEEE Trans. Circuits Syst., pp. 836-838, Oct. 1975. W. A . Porter, “Concurrent forms of signal processing algorithms,” IEEE Trans. Circuits Syst., vol. 36, pp. 553-560, Apr. 1989. H. G. Yeh and H. Y. Yeh, “Implementation of the discrete Fourier transform on 2-dimensional systolic processors,” Proc. Inst. Elec. Eng., Part G , vol. 134, no. 4, pp. 181-185, Aug. 1987. Concurrent Architectures for Signal Processing, W. A. Porter, Ed., special issue of Circuits, Syst., Signal Processing, vol. 7, 1988.

William A. Porter (S’56-M’61-SM’70) received the B.S.E. degree from Michigan Technical Uni- versity in 1957, and the M.S.E. and Ph.D. de- grees from the University of Michigan in 1958 and 1961. respectively.

From 1961 to 1977 he served on the Faculty of the University of Michigan, where he was ap- pointed Professor of Electrical and Computer En- gineering in 1969. In 1977 he assumed the duties of Chairman of Electrical and Computer Engi- neering at Louisiana State University. He is cur-

rently an Eminent Scholar in Computer Engineering at the University of Alabama in Huntsville. His current interests include array architectures, fast signal processing algorithms, distributive memory computing, and multidimensional system theory. He is an Associate Editor of the journals: Circuits, Systems, and Signal Processing and Multidimensional Systems and Signal Processing. He served as General Chairman of ComCon 88, and has organized several special issues in various journals.

Documents

Fast forms of banded maps