Download pdf - Ming-Bo Lin A. Yavuz Oru»c - UMDyavuz/research/researchpapers/... · 1999. 8. 28. · Ming-Bo Lin Electronic Engineering Department National Taiwan Institute of Technology 43, Keelung

CONSTANT TIME INNER PRODUCT AND MATRIX

COMPUTATIONS ON PERMUTATION NETWORK PROCESSORS

∗

Ming-Bo LinElectronic Engineering Department

National Taiwan Institute of Technology43, Keelung Road Section 4, Taipei, Taiwan

andA. Yavuz Oruç

Electrical Engineering DepartmentInstitute of Advanced Computer Studies

University of Maryland, College Park, MD 20742-3025E-mail: [email protected] Telephone: (301) 405-3663

ABSTRACT

Inner product and matrix operations find extensive use in algebraic computations.In this paper, we introduce a new parallel computation model, called a permu-tation network processor, to carry out these computations efficiently. Unlike thetraditional parallel computer architectures, computations on this model are car-ried out by composing permutations on permutation networks. We show that thesum of N algebraic numbers on this model can be computed in O(1) time using Nprocessors. We further show that the real and complex inner product and matrixmultiplication can both be computed on this model in O(1) time at the cost ofO(N) and O(N3), respectively, for N element vectors, and N ×N matrices. Theseresults compare well with the time and cost complexities of other high level parallelcomputer models such as PRAM and CRCW PRAM.Index Terms: complex inner product, complex matrix multiplication, permutationnetworks, real inner product, and real matrix multiplication.

∗This work is supported in part by the Ministry of Education, Taipei, Taiwan, Republic ofChina and in part by the Minta Martin Fund of the School of Engineering at the University ofMaryland.

1

1 Introduction

Inner product and matrix operations form the core of computations of vector and

array processors [5,7]. Many algorithms in signal and image processing, pattern

recognition, computer vision, and computer tomography require extensive use of

such operations [5,12,14,18]. Therefore, it is desirable to find novel parallel archi-

tectures to compute these operations fast and efficiently.

Traditional architectures for carrying out such operations are based on reducing

vector computations into scalar operations such as binary addition and multiplica-

tion [20,21]. As a result, much of the computations in vector and array processors

is handled by conventional arithmetic circuits such as carry look ahead adders,

recoded and cellular array multipliers and dividers [6]. While these conventional

circuits are optimized for speed and hardware, they still rely on a variety of building

blocks such as adder, subtractor and multiplier cells which often lead to nonuniform

arithmetic circuits for vector processors.

In this paper, we propose a new concept to carry out vector and matrix computa-

tions. Unlike the traditional architectures, this concept is based on coding not only

the operands but also the operations over the operands in such a way that a vector

or matrix computation reduces to composing permutation maps. Each operand is

coded into a permutation and addition or multiplication of two operands is car-

ried out by composing the permutations that correspond to these operands on a

permutation network. As a result, both addition and multiplication are reduced

to a single computation, i.e., that of composing permutations. Furthermore, any

other computation involving addition, subtraction and multiplication operations

are also reduced to just composing permutations.

We show that, on this new computation model, called a permutation network

processor, the sum of N n-bit numbers, the inner product of two vectors, each

containing N n-bit elements, and the multiplication of two N × N matrices with

2

n-bit entries can all be computed in O(1) steps. The first two computations require

O(N) processors and the matrix multiplication requires (N3) processors, where

each processor handles an n-bit input, and has O((n + lgN)2) bit-level cost and

O(n+ lgN) bit-level delay.

We note that these results compare well with the complexities for the same compu-

tations on other models. For example, on a PRAM model [1,8], all three computa-

tions take O(lgN) time with the same numbers of processors, where each processor

has two O(n+ lgN)-bit inputs, and uses arithmetic circuits with O(n2 + lgN) bit-

level cost. On a cube-connected parallel computer, the same three computations

also take O(lgN) time with the same numbers of processors and with the same pro-

cessor bit-level complexity [1]. On the combining CRCW PRAM, the same three

computations can all be done in O(1) time and with the same numbers of proces-

sors, where each processor has two O(n+ lgN)-bit inputs, and with O(n2 + lgN)

bit-level cost. In addition, this model must have a circuit to combine up to N

concurrent writes.

We also note that, even though the permutation network processor model stands

on its own, it ties with some earlier computation models that were reported in the

literature. One such model, called a processing network, was given [19] where a

mesh of processing elements was used to compute certain algebraic formulas. The

processing elements in this model can be programmed for arithmetic and routing

functions whose combinations lead to various algebraic expressions on the mesh

topology. More recently, a new parallel computer model, called a reconfigurable

bus system, has been introduced to solve a wide range of problems including sorting

problems [23], graph problems [13,3], and string problems [2]. All these problems

have been shown to be solvable in O(1) time on the reconfigurable bus system

model. As in the processing network model, processors are connected in this model

by some fixed topology such as the mesh, and each processor can be programmed

for some data processing as well as routing functions. It is assumed that the

3

signals can be broadcast between processors in constant time regardless of how far

the broadcast is carried [3,11],[22]-[26]. The essence of this assumption is that

once the processors are simultaneously programmed for some routing functions,

the signals that pass through them only encounter a propagation delay which is

short enough so as to be considered a constant. The same assumption also holds

for our model.

The remainder of this paper is organized as follows. In Section 2, we provide a brief

overview of permutation network processors. In Section 3, we show how real and

complex inner products can be performed on our permutation network processor

in O(1) time. In this section, we also describe how to sum N n-bit signed numbers

in O(1) time on the same model. In Section 4, we use these inner product units

to carry out the multiplication of two N ×N matrices in O(1) time using O(N3)permutation network processors. The paper is concluded in Section 5.

2 The Permutation Network Processor Model

The computations to be described in subsequent sections all rely on a permutation

network processor model which was introduced in [10,15,16]. Here we give a brief

overview of this model and introduce some changes so as to make it suitable for

these computations. The reader is referred to these citations for a more detailed

account.

A. The Model

A permutation network processor is obtained by cascading three components to-

gether as shown in Figure 1: an r-out-of-s residue encoder, a permutation network

stage, and an r-out-of-s residue decoder, where r and s are positive integers. The

r-out-of-s residue encoder has m inputs, representing an m-bit number X, and r

sets of outputs, X1, X2, . . . , Xr, where Xi contains mi outputs, for i = 1, 2, . . . , r.

Based on the value of its input X, the r-out-of-s residue encoder sets exactly one

4

binary residue encoder

r-out-of-s residue encoder

r-out-of-s residue decoder

permutation network

X r

X2

X1 R

R m-b

it in

put (

X)

n-bit input (Y)

m-b

it re

sult

(R)

y1 y2 yr

r

R

••• ••••••

•••

••• ••

•

•••

•••

•••

•••

•••

•••

•••

1

2

• • •

• • •

Figure 1: Organization of a permutation network arithmetic processor.

output in each Xi to 1 and all other outputs to 0. More precisely, the jth output in

Xi is set to 1, where j = X mod mi. The r-out-of-s residue decoder is an r-out-of-s

residue encoder whose inputs and outputs are switched. That is, it has r sets of

inputs R1, R2, . . . , Rr, each of which contains a 1 in exactly one of its inputs, and a

set of n outputs that represents an n-bit result. The 1’s in R1, R2, . . . , Rr are com-

bined into an n-bit result whose residues with respect to moduli m1,m2, . . . ,mr

are indicated by the positions of 1’s in R1, R2, . . . , Rr. The variable s specifies the

number of outputs (inputs) of the r-out-of-s residue encoder (decoder). That is,

s = m1 +m2 + . . . +mr.

The center stage of the permutation network arithmetic processor consists of a

permutation network with s inputs and s outputs. In addition to s inputs that

are connected to the outputs of the r-out-of-s residue encoder, the permutation

network also has an n-bit control input that represents the second operand Y to

the arithmetic processor. The permutation network encompasses r subnetworks

5

N1, N2, . . . , Nr, where the lines in Xi from the r-out-of-s residue encoder form the

inputs of network Ni and the lines in Ri form its outputs. Network Ni consists

of dlgmie stages of switches, numbered dlgmie − 1, . . . , 1, 0 in that order, fromleft to right, each having mi inputs and mi outputs such that the switch in stage

k, k = 0, 1, . . . , dlgmie − 1 has two switching states:state 0:input j is connected to output j, for all j = 0, 1, . . . ,mi − 1;state 1:input j is connected to output j+ 2k mod mi, for all j = 0, 1, . . . ,mi− 1.

Thus, the switch in stage k of network Ni realizes either the identity permutation

on its inputs (state 0) or the circular right shift permutation where all inputs are

circularly shifted to right by 2k mod mi positions (state 1). The permutations

that are realized by networks N1, N2, . . . , Nr are determined by the residues, yi =

Y mod mi; 1 ≤ i ≤ r. The residue yi is computed from Y by the binary residueencoder in Figure 1 which converts Y mod mi to its binary representation. The

kth least significant bit of yi then determines the state of the switch in the kth

stage in network Ni. If that bit is 0 then the stage is set to state 0 and if it is 1

then the stage is set to state 1.

In most of the computations that follow, we will need to cascade several permu-

tation network processors together. In such cases, the r-out-of-s residue encoders

and decoders are redundant and will be removed from the model. In this reduced

model, we only retain the permutation network and binary residue encoder.

B. The Cost and Time Assumptions

In the reduced permutation network processor model, each processor has an n-

bit input and r simple shift networks with m1,m2, . . . ,mr inputs. These net-

works when combined together provide a numerical range extending from 0 to

m1m2 . . .mr−1 for unsigned numbers and from−bm1m2 . . .mr/2c to bm1m2 . . .mr/2cfor signed numbers in 2’s complement form.

Let M = m1m2 . . .mr. It was shown in [10] that a permutation network processor

6

encompassing r subnetworks with m1,m2, . . . ,mr inputs can be constructed by

using O(lg2 M) logic gates. The two of the computations to be implemented on

permutation network processors, namely, the sum of N n-bit 2’s complement num-

bers and the inner product of two N -element vectors each of whose elements is an

n-bit 2’s complement number requires that N2n−1 ≈M. Thus, each of the N per-mutation network processors neededfor these two computations can be constructed

with O((lgN + n)2) logic gates.

The third computation, i.e., the product of two N×N matricies can be carried outby performing N2 inner product operations. Thus, the matrix multiplication prob-

lem can be solved by using N3 permutation network processors each constructed

from O((lgN + n)2) logic gates.

As for the delay of the permutation network processors needed for these three

computations, it was shown in [10] that a permutation network processor encom-

passing r subnetworks with m1,m2, . . . ,mr inputs has O(lg lgM) bit-level delay.

Given that M ≈ 2nN, it follows that all three computations mentioned above canbe performed by using permutation network processors with O(n+ lgN) bit-level

delay.

We point out that these bit-level cost and delay complexities are comparable with

those for other parallel computer models. The last two computations require a

multiplier circuit for n-bit operands and this exacts O(n2) bit-level cost to attain

O(lg n) bit-level delay regardless of the model used. Given this, in obtaining the

cost and time complexities of ther algorithms that follow, we will assume that our

permutation network processors have constant cost and constant time where the

cost and time are expressed in word level as in other parallel computer models.

7

3 Inner Product Processors

In this section, we show how to sum N n-bit numbers and compute real and

complex inner products using permutation network processors.

A. Summation of N n-bit numbers

Assume that N n-bit numbers to be added together are all in 2’s complement

form. This implies that the sum can be as large as 2n−1N and to avoid a possible

overflow, the dynamic range M of the permutation network processor must satisfy

2n−1N ≤M/2, that is, 2nN ≤M.

As described in [10], the set ZM = {0, 1, . . . ,M − 1} under addition modulo Mforms a group which is isomorphic to a cyclic permutation group (〈π〉, ·) of order Mand generated by a permutation π defined over {0, 1, . . . ,M−1}. The isomorphismbetween (ZM ,+M) and (〈π〉, ·) is fixed by mapping a generator of ZM onto π. Nowlet M = m1m2 . . .mr, where mi and mj are relatively prime for all i 6= j; 1 ≤ i, j ≤r. For ZM , we fix 1 as its generator and let π be the product of r disjoint cycles,

π1, π2, . . . , πr, where

π1 = (0 1 . . . m1 − 1)

πi = (m1 +m2 + . . . +mi−1 m1 +m2 + . . . +mi−1 + 1 . . .

m1 +m2 + . . . +mi−1 +mi − 1) 1 ≤ i ≤ r. (1)

Under the isomorphism fixed by mapping 1 to π, element X ∈ ZM is mapped toπX = (π1π2 . . .πr)

X or since π1, π2, . . . , πr are disjoint, X is mapped to πX1 π

X2 . . .π

Xr .

As a consequence, the sum X1 +X2 +. . .+XN modulo M, Xi ∈ ZM for 1 ≤ i ≤ Nis mapped to

πX1+MX2+M ...+MXN = πX1+MX2+M ...+MXN1 πX1+MX2+M ...+MXN2 , . . .

πX1+MX2+M ...+MXNr (2)

8

where +M denotes modulo M addition. Since πi is a cycle of length mi

πX1+MX2+M ...+MXN = πX1+m1X2+m1 ...+m1XN1 π

X1+m2X2+m2 ...+m2XN2 . . .

πX1+mrX2+mr ...+mrXNr (3)

or by Horner’s rule

πX1+MX2+M ...+MXN = π((X1+m1X2)+m1 ...+m1XN )1

π((X1+m2X2)+m2 ...+m2XN )2 . . .

π((X1+mrX2)+mr ...+mrXN )r , (4)

where +mi denotes modulo mi; 1 ≤ i ≤ r, addition.

To compute Equation (4), we cascade N permutation network processors together

and each operand Xi; 1 ≤ i ≤ N, is converted into its corresponding residue code(xi1, xi2, . . . , xir) by using a binary residue encoder such as one of those described

in [10]. These converted residue codes (xi1, xi2, . . . , xir); 1 ≤ i ≤ N, are then used toset up the switching states of corresponding permutation networks. The complete

algorithm for computing the summation of N n-bit numbers on a permutation

network processor is then given as follows.

Algorithm 1 (Summation of N n-bit 2’s complement numbers)

Input: N n-bit 2’s complement numbers, X1, X2, . . . , XN .

Output: Sum of X1, X2, . . . , XN in 2’s complement form.

Method:

Step 1: ConvertXi’s into their corresponding binary residue codes (xi1, xi2, . . . , xir),

in parallel.

Step 2: Add X1, X2, . . . , XN .

Step 2.1: Set up the switching states of permutation network processors in

parallel by the binary residue codes obtained in Step 1.

9

modulo 3 binary residue encoder






3-ou

t-of

-15

bina

ry

resi

due

deco

der




1X =13 2X =12 3X =9

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

result

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

processor 1 processor 2 processor 3

0 1 0 0 0 0

0 1 1 0 1 0 1 0 0

1 1 0 1 0 1 0 1 0

1

0

1000

0

Figure 2: The permutation network processor shown to compute the sum13 +105 12 +105 9 = 34.

Step 2.2: Feed all input lines marked 0,m1,m1+m2, . . . ,m1+m2+. . .+mr−1

with a “1” and all the other input lines with a “0.”

Step 3: Decode the r-out-of-s residue code obtained at the outputs of the last

permutation network processor in the cascade into its binary equivalent.

Figure 2 shows an example with N = 3, n = 5, and X1 = 13, X2 = 12, and X3 = 9.

The light lines indicate the paths of “1” between inputs and outputs. To avoid a

possible overflow, the dynamic range of the processor is chosen so that it satisfies

10







3-ou

t-of

-15

bina

ry

resi

due

deco

der




1X =13 2X = - 12 3X =9

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

result

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14


0 1 0 0 0 0

0 1 1 0 1 1 1 0 0

1 1 0 0 1 0 0 1 0

1

00010

0

Figure 3: The permutation network processor shown to compute13−105 12 +105 9 = 10.

2nN ≤M, that is, 3×25 ≤M. Therefore, M is picked as 105 = 3×5×7. We notethat the summands can be negative as well. To illustrate this, Figure 3 depicts

how 13−105 12 +105 9 = 10 is computed on the same network given in Figure 2.

This algorithm requires O(N) processors and since it does not contain any loop

and each step takes O(1) time, the total execution time of this algorithm is O(1).

B. Real Inner-Product Processor

Let X = (X1, X2, . . . , XN) and Y = (Y1, Y2, . . . , YN) be two N -element vectors,

where Xi, Yi ∈ ZM = {0, 1, . . . ,M − 1}. Then the real inner product of X and Y

11

is a real-valued function defined as

X ·Y =N∑j=1

XjYj. (5)

To compute XjYj on a permutation network processor, we recall that the multi-

plication modulo M over the set ZM forms a monoid. Let Zmi = {0, 1, 2, . . . ,m1−1} and (Zmi ,×mi) denote the monoid under modulo mi multiplication of ele-ments in Zmi ; 1 ≤ i ≤ r. Let M = m1m2 . . .mr, where m1,m2, . . . ,mr are allprimes. Let ZM = Zm1 ⊗ Zm2 ⊗ . . . ⊗ Zmr denote the direct product of Zmi ; 1 ≤i ≤ r, whose elements are r-tuples (x1, x2, . . . , xr), where xi ∈ Zmi . For any(x1, x2, . . . , xr), (y1, y2, . . . , yr) ∈ ZM define

(x1, x2, . . . , xr) ⊗ (y1, y2, . . . , yr) =

(x1×m1y1, x2×m2y2, . . . , xr×mryr). (6)

Now we carry over the product XiYi onto (ZM ,⊗) by setting up a monoid isomor-phism f between Zm and ZM as follows:

f(Xj) = (xj,1, xj,2, . . . , xj,r), ∀Xj ∈ ZM , 1 ≤ j ≤ N (7)

where xj,i = Xj mod mi; 1 ≤ i ≤ r. Let Xj, Yj ∈ ZM . It is easy to show that

f(1) = (1, 1, . . . , 1) and

f(Xj · Yj) = f(Xj)⊗ f(Yj) (8)

and hence f is an isomorphism, and using this isomorphism, we can compute the

real inner product X ·Y in ZM as

X ·Y =N∑j=1

XjYj

=N∑j=1

(xj,1, xj,2, . . . , xj,r)⊗ (yj,1, yj,2, . . . , yj,r)

= (N∑j=1

xj,1×m1yj,1,N∑j=1

xj,2×m2yj,2, . . . ,N∑j=1

xj,r×mryj,r). (9)

12

0

1

2

3

4

0

1

2

3

OR

0

1

2

3 1-ou

t-of

-5 r

esid

ue d

ecod

er

0

1

2

3

4

AND

AND

AND

AND

1

2

3

4

1-ou

t-of

-4

resi

due

deco

der

0

1

0

1

2

3

4

2

3

(generator =2)

g

g

g-1

1

0

0(binary

residue code )

(1-out-of-5 residue code)

(1-o

ut-o

f-5

resi

due

code

)in

put (

X=

3)

input (Y = 3)

output (R=4)

Figure 4: Permutation network modulo 5 multiplier.

The corresponding permutation network real inner product processor is constructed

as follows. For each modulus, mi, N modulo mi permutation network processors

are cascaded. One operand of each processor comes from the previous stage and

the other operand comes from the output of a modulo mi permutation network

multiplier. An example of modulo 5 permutation network multiplier is shown in

Figure 4 and a detailed description of modulo mi permutation network multiplier

can be found in [10].

The following algorithm shows how the inner product is carried out using such

multipliers.

Algorithm 2 (Computing the real inner product of two vectors)

Input: Two vectors X = (X1, X2, . . . , XN) and Y = (Y1, Y2, . . . , YN), where each

Xi and Yi is an n-bit 2’s complement number.

Output: The real inner product of X and Y represented in 2’s complement form.

Method:

Step 1: Convert each element of X and Y into its corresponding residue code in

parallel.

13

Step 2: Multiply xji and yji for 1 ≤ j ≤ N and 1 ≤ i ≤ r in parallel.

Step 3: Compute the inner product by adding together the products obtained in

Step 2 over the residue code domain.

Step 3.1: Set up the switching states of permutation network processors in

parallel by the binary residue codes obtained in Step 2.



Step 4: Decode the r-out-of-s residue code obtained at the outputs of the last

permutation network processor in the cascade into its binary equivalent.

An example with N = 3 and n = 3 is shown in Figure 5. It is easy to see that

Algorithm 2 requires O(N) processors and its total execution time is O(1) since

each step takes O(1) time and there is no loop inside the algorithm.

C. Complex Inner-Product Processor

The inner product of two complex vectors U = (U1, U2, . . . , UN) and V = (V1, V2, . . . , VN)

is a complex-valued function defined as follows:

U ·V =N∑j=1

UjV∗j , (10)

where Uj and Vj; 1 ≤ j ≤ N, are complex elements, and V ∗j denotes the complexconjugate of Vj.

Let Uj = Re(Uj) + iIm(Uj) and Vj = Re(Vj) + iIm(Vj), where i =√−1, then

U ·V =N∑j=1

{Re(Uj) + iIm(Uj)}{Re(Vj)− iIm(Vj)}

=N∑j=1

{Re(Uj)Re(Vj) + Im(Uj)Im(Vj)}

+iN∑j=1

{Im(Uj)Re(Vj)−Re(Uj)Im(Vj)}. (11)

14

3-ou

t-of

-15

bina

ry

resi

due

deco

der

012

34567

89

1011121314

××3

××5

××7

××3

××5

××7

××3

××5

××7

012

34567

89

1011121314

1X 1Y 2X 2Y 3X 3Y

result

i1-out-of-m residue encoder modulo i permutation network multiplier×× i


mod 7 permutation

network adder

mod 7 permutation

network adder

mod 7 permutation

network adder

mod 5 permutation

network adder

mod 5 permutation

network adder

mod 5 permutation

network adder

mod 3 permutation

network adder

mod 3 permutation

network adder

mod 3 permutation

network adder

Figure 5: The permutation network processor for computing the real inner-productof X and Y (N = 3 and n = 3.)

15

Therefore, the complex inner product can be computed by carrying out four real

inner products. Let ⊕ and ª be binary operators defined on ZM by

(x1, x2, . . . , xr) ⊕ (y1, y2, . . . , yr) =

(x1 +m1 y1, x2 +m2 y2, . . . , xr +mr yr) (12)

(x1, x2, . . . , xr) ª (y1, y2, . . . , yr) =

(x1 −m1 y1, x2 −m2 y2, . . . , xr −mr yr), (13)

where (x1, x2, . . . , xr), (y1, y2, . . . , yr) ∈ ZM . Using the isomorphism constructed inthe previous section, each real inner product can then be computed in ZM as in

Equation (9), and hence

U ·V =N∑j=1

{Re(uj,1, uj,2, . . . , uj,r)⊗Re(vj,1, vj,2, . . . , vj,r)

⊕Im(uj,1, uj,2, . . . , uj,r)⊗ Im(vj,1, vj,2, . . . , vj,r)}+

iN∑j=1

{Im(uj,1, uj,2, . . . , uj,r)⊗Re(vj,1, vj,2, . . . , vj,r)

ªRe(uj,1, uj,2, . . . , uj,r)⊗ Im(vj,1, vj,2, . . . , vj,r)} .

=

N∑j=1

(Re(uj,1)×m1Re(vj,1) +m1 Im(uj,1)×m1Im(vj,1)),

N∑j=1

(Re(uj,2)×m2Re(vj,2) +m2 Im(uj,2)×m2Im(vj,2)), . . . ,

N∑j=1

(Re(uj,r)×mrRe(vj,r) +mr Im(uj,r)×mrIm(vj,r))+

i

N∑j=1

(Im(uj,1)×m1Re(vj,1)−m1 Re(uj,1)×m1Im(vj,1)),

N∑j=1

(Im(uj,2)×m2Re(vj,2)−m2 Re(uj,2)×m2Im(vj,2)), . . . ,

N∑j=1

(Im(uj,r)×mrRe(vj,r)−mr Re(uj,r)×mrIm(vj,r)) . (14)

The complex inner product processor can then be constructed by computing the

real and imaginary parts on two permutation network real inner product proces-

16

sors. These real inner product processors require some minor modifications. For

the real part, the actual inner product consists of two real inner products, one

combining the real parts of U and V and the other combining their imaginary

parts. Thus the subnetwork for each modulus in each processor is obtained by

cascading two multiplication and two addition units. For the imaginary part, the

actual inner product also consists of two real inner products, however, these two

inner products are not added together; rather the second one is subtracted from

the first one. Therefore, we cascade one multiplication and addition unit with a

multiplication and subtraction unit. The subtraction unit can be implemented

by a circular left shift permutation network subtracter as described in [10]. The

general structure for the permutation network complex inner product processor is

shown in Figures 6 and 7. Figure 6 is the real part and Figure 7 is the imaginary

part. We note that the sum expressions on the left hand side are just for labeling

the inputs and do not imply any summation.

Algorithm 3 (Computing the complex inner product of two vectors)

Input: Two vectors U = (U1, U2, . . . , UN) and V = (V1, V2, . . . , VN). Uj and Vj

are complex numbers in the form: a+ ib, where a and b are n-bit 2’s complement

numbers.

Output: The complex inner product of U and V represented in a form c + id,

where c and d are 2’s complement numbers.

Method:

Step 1: Convert each element of U and V into its corresponding residue code in

parallel. (Note that the real part and imaginary part use separate residue

decoders.)

Step 2: Compute the real multiplications, Re(uji)Re(vji), Im(uji)Im(vji),

Im(uji)Re(vji), and Re(uji)Im(vji), for 1 ≤ j ≤ N and 1 ≤ i ≤ r in parallel.

Step 3: Compute the real and imaginary parts of the complex inner product by

17

××m

1××

m1

××m

i××

mi

××m

r××

mr

Re(

U ) 1

Re(

V ) 1

Im(V

) 1Im

(U ) 1

•••

•••

•••

•••

××m

1××

m1

××m

i××

mi

××m

r××

mr

Re(

U ) 2

Re(

V ) 2

Im(V

) 2Im

(U ) 2

•••

•••

•••

•••

××m

1××

m1

××m

i××

mi

××m

r××

mr

Re(

U ) N

Re(

V ) N

Im(V

) NIm

(U ) N

•••

•••

•••

•••

r-out-of-s binary residue decoder

Re(

R)

•••

•••

•••

•••

•••

•••

•••

•••

•••

•••

•••

•••

•••

•••

•••

•••

•••

•••

•••

•••

•••

•••

•••

•••

proc

esso

r 1

proc

esso

r 2

proc

esso

r N

1-ou

t-of

-m

resi

due

enco

der

mod

ulo

i per

mut

atio

n ne

twor

k m

ultip

lier

××i

PN

: per

mut

atio

n ne

twor

ki

0 1 −m

11

∑i-1 k=

1m

k

∑i-1 k=

1m

k+1

∑i k=

1m

k-1

∑r-

1k=

1m

k

∑r-

1k=

1m

k+1

∑r k=

1m

k-1

mod

m

PN

ad

der

1m

od m

P

N

adde

r

1m

od m

P

N

adde

r

1m

od m

P

N

adde

r

1m

od m

P

N

adde

r

1m

od m

P

N

adde

r

1

mod

m

PN

ad

der

im

od m

P

N

adde

r

im

od m

P

N

adde

r

im

od m

P

N

adde

r im

od m

P

N

adde

r

im

od m

P

N

adde

r

i

mod

m

PN

ad

der

rm

od m

P

N

adde

r

rm

od m

P

N

adde

r

rm

od m

P

N

adde

r

rm

od m

P

N

adde

r

rm

od m

P

N

adde

r

r

Figure 6: The real part of a permutation network complex inner-product processor.

18

××m

1××

m1

××m

i××

mi

××m

r××

mr

Re(

U ) 1

Re(

V ) 1

Im(V

) 1Im

(U ) 1

•••

•••

•••

•••

××m

1××

m1

××m

i××

mi

××m

r××

mr

Re(

U ) 2

Re(

V ) 2

Im(V

) 2Im

(U ) 2

•••

•••

•••

•••

××m

1××

m1

××m

i××

mi

××m

r××

mr

Re(

U ) N

Re(

V ) N

Im(V

) NIm

(U ) N

•••

•••

•••

•••

r-out-of-s binary residue decoder

Im(R

)

•••

•••

•••

•••

•••

•••

•••

•••

•••

•••

•••

•••

•••

•••

•••

•••

•••

•••

•••

•••

•••

•••

•••

•••

1-ou

t-of

-m

resi

due

enco

der

mod

ulo

i per

mut

atio

n ne

twor

k m

ultip

lier

××i

PN

: per

mut

atio

n ne

twor

k

proc

esso

r 1

proc

esso

r 2

proc

esso

r N

0 1

−m

11

∑i-1 k=

1m

k

∑i-1 k=

1m

k+1

∑i k=

1m

k-1

∑r-

1k=

1m

k

∑r-

1k=

1m

k+1

∑r k=

1mk-

1

mod

m

PN

su

btra

cter1

mod

m

PN

ad

der

1m

od m

P

N

subt

ract

er1m

od m

P

N

adde

r

1m

od m

P

N

subt

ract

er1m

od m

P

N

adde

r

1

mod

m

PN

ad

der

im

od m

P

Nsu

btra

cter

im

od m

P

N

adde

r

im

od m

P

Nsu

btra

cter

im

od m

P

N

adde

r

im

od m

P

Nsu

btra

cter

i

mod

m

PN

ad

der

rm

od m

P

N

subt

ract

er

rm

od m

P

N

adde

r

rm

od m

P

N

subt

ract

er

rm

od m

P

N

adde

r

rm

od m

P

N

subt

ract

er

r

Figure 7: The imaginary part of a permutation network complex inner-productprocessor.

19

adding together the corresponding products obtained in Step 2 according to

Equation (14) over the residue code domain.

Step 3.1: Set up the switching states of permutation network processors

(adders and subtracters) in parallel by the binary residue codes obtained

in Step 2.



Step 4: Decode the r-out-of-s residue codes obtained at the outputs of the last

permutation network processors in the real and imaginary parts into their

binary equivalents.

As in the previous algorithm, it is easy to see that total execution time of Algorithm

3 is O(1). Its total cost is O(N) as the real part and the imaginary part require

N processors, respectively. In more exact terms, for the real part, each processor

consists of two permutation network adders in cascade. For the imaginary part,

each processor consists of a permutation network adder and a permutation network

subtracter. Therefore, a total of 4N processors are required by this algorithm.

4 Matrix Multiplication

In this section, we use permutation network inner product processors to carry out

real and complex matrix multiplications.

A. Real Matrix Multiplication

Let Aj be the jth row of A and Bk be the kth column of B. Let A and B be

N ×N real matrices and C = A×B. C can be computed by using the followingprocedure:

For j = 1 to N do in parallel

For k = 1 to N do in parallel

20

Cj,k = Aj ·BkEndfor

Endfor.

Hence, if we use N2 inner product processors in parallel, the multiplication of two

real matrices can be computed in O(1) steps as follows.

Algorithm 4 (Computing the real matrix multiplication of two real ma-

trices)

Input: Two N ×N real matrices A and B. Each element of A and B is an n-bit2’s complement number.

Output: A real product matrix C = A×B.Method:

Step 1: Convert each element of A and B into its corresponding residue code in

parallel.

Step 2: Compute the N2 inner products using Algorithm 2.

A schematic version of this algorithm is depicted in Figure 8. As can be seen from

the figure, all inner product processors operate in parallel. Since each executes in

O(1) time, Algorithm 4 takes O(1) time to execute. It needs a total of N2 inner

product processors each having O(N) cost and hence its cost is O(N3).

B. Complex Matrix Multiplication

Now we consider the multiplication of two complex matrices. Let R = A+ iB and

S = C + iD be two complex matrices of size N ×N. Let T = RS = E + iF be theproduct matrix. Then

T = RS = (A + iB)(C + iD)

= (AC−BD) + i(AD + BC), (15)

and hence the complex matrix multiplication is reduced to four real matrix multi-

plications, one addition, and one subtraction. Let Aj and Bj denote the jth rows

21

• • •

• • •

• • •

A1

A2

AN

C = A B2,1 12•

C = A BN,1 1N• C =A BN,2 2N• C =A BN,N NN•

C = A B2,2 22• C = A B2,N N2•

C = A B1,1 11• C = A B1,2 21• C = A B1,N N1•

• • •

• • •

• • •

• • •

B 1 B 2 BN• • •

Figure 8: Real matrix multiplication on permutation network inner-product pro-cessors. (Each rectangular box represents a permutation network inner productprocessor).

of A and B, respectively and Ck and Dk denote the jth columns of C and D,

respectively. Thus, to compute E = AC−BD, we use the following procedure.



Ej,k = AjCk −BjDkEndfor

Endfor.

The inner equation is a difference of two inner products. Comparing this to the

imaginary part of the complex inner product, it is easy to see that they have the

same algebraic form. Therefore, matrix E can be computed using Algorithm 4 by

replacing the inner product processor in the algorithm with the imaginary part of

the complex inner product processor shown in Figure 7. This amounts to replacing

each of the inner product processors in Figure 8 by the imaginary inner product

processor given in Figure 7.

22

Likewise, to compute F = AD + BC, we use the procedure:



Fj,k = AjDk + BjCk

Endfor

Endfor.

The inner equation entails two inner products as in the real part of the complex

inner product processor shown in Figure 6. Thus F can be computed using Algo-

rithm 4.

Combining these facts, we conclude that the multiplication of two N ×N complexmatrices can be carried in O(1) time using O(N3) permutation network processors.

5 Concluding Remarks

In this paper, we proposed permutation network processors to compute algebraic

sums, inner and matrix products. It has been shown that the algebraic sum of N n-

bit 2’s complement numbers can be computed inO(1) time on such a processor with

O(N) cost. The inner product of two N -element vectors (both real and complex)

with n-bit elements can also be done in O(1) time using a similar processor with

O(N) cost. On the other hand, the matrix multiplication takes O(1) time but with

O(N3) permutation network processors.

These results are important in that they establish that one can avoid using conven-

tional adder and multiplier circuits to carry out vector and matrix computations.

It will be worthwhile to extend them to other computations such as discrete trans-

forms, convolutions, and correlation computations. We anticipate that all of these

computations can also be done in O(1) time and O(N2) cost and we plan to present

our results on these computations in another place.

23

References

[1] Selim G. Akl,“The design and analysis of parallel algorithms,” Prentice-Hall

Pub., Englewood Cliffs, N. J., 1989.

[2] D. M. Champion and J. Rothstein,“Immediate parallel solution of the longest

common subsequence problem,” Proc. IEEE Int. Conf. Parallel Processing,

pp. 70-77, 1987.

[3] Gen-Huey Chen, Bing-Feng Wang, and Chi-Jen Lu, “On the parallel compu-

tation of the algebraic path problem,”IEEE Trans. Parallel and Distributed

Systems, vol. 3 , pp. 251-256, March 1992.

[4] K. M. Elleithy, M. A. Bayoumi, and K. P. Lee,“θ(lgN) architecture for RNS

arithmetic decoding,” IEEE 9th Symposium on Computer Arithmetic, pp.

202-209, 1989.

[5] John L. Gustafson,“Computer-intensive processors and multicomputers,” in

Parallel processing for supercomputers and artificial intelligence, Kai Hwang

and Douglas Degroot, Eds. pp. 169-201, 1989.

[6] Kai Hwang, “Computer arithmetic: principle, architecture, and design,”

John Wiley & Sons Pub., 1979.

[7] Kai Hwang and Faye A. Briggs, “Computer architecture and parallel process-

ing,” McGraw-Hill Pub., 1984.

[8] S. Lakshmivarahan and Sudarshan K. Dhall,“Analysis and design of parallel

algorithms,” McGraw-Hill Pub., 1990.

[9] Ming-Bo Lin and A. Yavuz Oruç,“The design of a network-based arithmetic

processor,” UMIACS-TR-91-141, University of Maryland, College Park, MD,

October 1991.

24

[10] Ming-Bo Lin, “Unified Algebraic Computations on Permutation Networks,”

Ph.D. Dissertation, Electrical Engineering Department, University of Mary-

land, College Park, 1992.

[11] Massimo Maresca and Hungwen Li, “Connection autonomy in SIMD comput-

ers: A VLSI implementation,”Journal of Parallel and Distributed Computing,

vol. 7, pp. 302-320, 1989.

[12] R. M. Merseran and D. E. Dudgeon, “Multidimensional signal processing,”

Prentice-Hall Pub., Englewood Cliffs, N. J., 1985.

[13] R. Miller, V. K. Prasanna Kumar, D. Reisis, and Q. F. Stout,“Data movement

operations and applications on reconfigurable VLSI arrays,”Proc. IEEE Int.

Conf. Parallel Processing, pp. 205-208, 1988.

[14] A. V. Oppenheim and R. W. Schafer,“Digital signal processing,” Prentice-

Hall Pub., Englewood Cliffs, N. J., 1975.

[15] A. Yavuz Oruç, Vinod G. J. Peris, and M. Yaman Oruç, “Parallel modu-

lar arithmetic on a permutation network,” Proceedings of 1991 International

Conference on Parallel Processing, vol. 1, pp. 706-707, St. Charles, IL, Aug.

1991.

[16] Vinod G. J. Peris,“Fast modular arithmetic on permutation networks,” Mas-

ter Thesis, Electrical Engineering Department, University of Maryland, Col-

lege Park,MD, 1992.

[17] Stanislaw J. Piestrak,“Design of residue generators and multi-operand mod-

ular adders using carry-save adders,” IEEE 10th Symposium on Computer

Arithmetic, pp. 100-107, 1991.

[18] R. J. Schalkoff,“Digital image processing and computer vision,” John Wiley

& Sons Pub., inc., New-York, 1989.

25

[19] W. Shen and A. Yavuz Oruç, “Mapping algebraic formulas onto mesh con-

nected processor networks,” Information Sciences and Systems, Princeton

University, N.J., pp.535-538, 1986.

[20] S. P. Smith and H. C. Torng, “Design of a fast inner product processor,”Proc.

IEEE 7th Symposium on Computer Arithmetic, pp. 38-43, 1985.

[21] E. E. Swartzlander, Jr., Barry K. Gilbert, and Irving S. Reed, “Inner product

computers,”IEEE Trans. Computers, vol. C-27, pp. 21-31, January 1978.

[22] B. F. Wang, C. J. Lu, and G. H. Chen, “Constant time algorithms for the

transitive closure problem and its applications,”Proc. IEEE Int. Conf. Par-

allel Processing, pp. 52-59, 1990.

[23] B. F. Wang, G. H. Chen, and F. C. Lin, “Constant time sorting on a processing

array with a reconfigurable bus system,”Information Processing Letters, vol.

34, pp. 187-192, 1990.

[24] B. F. Wang and G. H. Chen, “Constant time algorithms for the transitive

closure and some related graph problems on processor arrays with reconfig-

urable bus systems,”IEEE Trans. Parallel and Distributed Systems, vol. 1,

pp. 500-507, October 1990.

[25] B. F. Wang and G. H. Chen, “Two-dimensional processor array with a recon-

figuration bus system is at least as powerful as CRCW model,”Information

Processing Letters, vol. 36, pp. 31-36, October 1990.

[26] B. F. Wang, G. H. Chen, and Hungwen Li, “Configurational computation:

a new computation method on processor arrays with reconfigurable bus sys-

tems,”Proc. IEEE Int. Conf. Parallel Processing, vol. III pp. 42-49, 1991.

26