CONSTANT TIME INNER PRODUCT AND MATRIX
COMPUTATIONS ON PERMUTATION NETWORK PROCESSORS
∗
Ming-Bo LinElectronic Engineering Department
National Taiwan Institute of Technology43, Keelung Road Section 4, Taipei, Taiwan
andA. Yavuz Oruç
Electrical Engineering DepartmentInstitute of Advanced Computer Studies
University of Maryland, College Park, MD 20742-3025E-mail: [email protected] Telephone: (301) 405-3663
ABSTRACT
Inner product and matrix operations find extensive use in algebraic computations.In this paper, we introduce a new parallel computation model, called a permu-tation network processor, to carry out these computations efficiently. Unlike thetraditional parallel computer architectures, computations on this model are car-ried out by composing permutations on permutation networks. We show that thesum of N algebraic numbers on this model can be computed in O(1) time using Nprocessors. We further show that the real and complex inner product and matrixmultiplication can both be computed on this model in O(1) time at the cost ofO(N) and O(N3), respectively, for N element vectors, and N ×N matrices. Theseresults compare well with the time and cost complexities of other high level parallelcomputer models such as PRAM and CRCW PRAM.Index Terms: complex inner product, complex matrix multiplication, permutationnetworks, real inner product, and real matrix multiplication.
∗This work is supported in part by the Ministry of Education, Taipei, Taiwan, Republic ofChina and in part by the Minta Martin Fund of the School of Engineering at the University ofMaryland.
1
1 Introduction
Inner product and matrix operations form the core of computations of vector and
array processors [5,7]. Many algorithms in signal and image processing, pattern
recognition, computer vision, and computer tomography require extensive use of
such operations [5,12,14,18]. Therefore, it is desirable to find novel parallel archi-
tectures to compute these operations fast and efficiently.
Traditional architectures for carrying out such operations are based on reducing
vector computations into scalar operations such as binary addition and multiplica-
tion [20,21]. As a result, much of the computations in vector and array processors
is handled by conventional arithmetic circuits such as carry look ahead adders,
recoded and cellular array multipliers and dividers [6]. While these conventional
circuits are optimized for speed and hardware, they still rely on a variety of building
blocks such as adder, subtractor and multiplier cells which often lead to nonuniform
arithmetic circuits for vector processors.
In this paper, we propose a new concept to carry out vector and matrix computa-
tions. Unlike the traditional architectures, this concept is based on coding not only
the operands but also the operations over the operands in such a way that a vector
or matrix computation reduces to composing permutation maps. Each operand is
coded into a permutation and addition or multiplication of two operands is car-
ried out by composing the permutations that correspond to these operands on a
permutation network. As a result, both addition and multiplication are reduced
to a single computation, i.e., that of composing permutations. Furthermore, any
other computation involving addition, subtraction and multiplication operations
are also reduced to just composing permutations.
We show that, on this new computation model, called a permutation network
processor, the sum of N n-bit numbers, the inner product of two vectors, each
containing N n-bit elements, and the multiplication of two N × N matrices with
2
n-bit entries can all be computed in O(1) steps. The first two computations require
O(N) processors and the matrix multiplication requires (N3) processors, where
each processor handles an n-bit input, and has O((n + lgN)2) bit-level cost and
O(n+ lgN) bit-level delay.
We note that these results compare well with the complexities for the same compu-
tations on other models. For example, on a PRAM model [1,8], all three computa-
tions take O(lgN) time with the same numbers of processors, where each processor
has two O(n+ lgN)-bit inputs, and uses arithmetic circuits with O(n2 + lgN) bit-
level cost. On a cube-connected parallel computer, the same three computations
also take O(lgN) time with the same numbers of processors and with the same pro-
cessor bit-level complexity [1]. On the combining CRCW PRAM, the same three
computations can all be done in O(1) time and with the same numbers of proces-
sors, where each processor has two O(n+ lgN)-bit inputs, and with O(n2 + lgN)
bit-level cost. In addition, this model must have a circuit to combine up to N
concurrent writes.
We also note that, even though the permutation network processor model stands
on its own, it ties with some earlier computation models that were reported in the
literature. One such model, called a processing network, was given [19] where a
mesh of processing elements was used to compute certain algebraic formulas. The
processing elements in this model can be programmed for arithmetic and routing
functions whose combinations lead to various algebraic expressions on the mesh
topology. More recently, a new parallel computer model, called a reconfigurable
bus system, has been introduced to solve a wide range of problems including sorting
problems [23], graph problems [13,3], and string problems [2]. All these problems
have been shown to be solvable in O(1) time on the reconfigurable bus system
model. As in the processing network model, processors are connected in this model
by some fixed topology such as the mesh, and each processor can be programmed
for some data processing as well as routing functions. It is assumed that the
3
signals can be broadcast between processors in constant time regardless of how far
the broadcast is carried [3,11],[22]-[26]. The essence of this assumption is that
once the processors are simultaneously programmed for some routing functions,
the signals that pass through them only encounter a propagation delay which is
short enough so as to be considered a constant. The same assumption also holds
for our model.
The remainder of this paper is organized as follows. In Section 2, we provide a brief
overview of permutation network processors. In Section 3, we show how real and
complex inner products can be performed on our permutation network processor
in O(1) time. In this section, we also describe how to sum N n-bit signed numbers
in O(1) time on the same model. In Section 4, we use these inner product units
to carry out the multiplication of two N ×N matrices in O(1) time using O(N3)permutation network processors. The paper is concluded in Section 5.
2 The Permutation Network Processor Model
The computations to be described in subsequent sections all rely on a permutation
network processor model which was introduced in [10,15,16]. Here we give a brief
overview of this model and introduce some changes so as to make it suitable for
these computations. The reader is referred to these citations for a more detailed
account.
A. The Model
A permutation network processor is obtained by cascading three components to-
gether as shown in Figure 1: an r-out-of-s residue encoder, a permutation network
stage, and an r-out-of-s residue decoder, where r and s are positive integers. The
r-out-of-s residue encoder has m inputs, representing an m-bit number X, and r
sets of outputs, X1, X2, . . . , Xr, where Xi contains mi outputs, for i = 1, 2, . . . , r.
Based on the value of its input X, the r-out-of-s residue encoder sets exactly one
4
binary residue encoder
r-out-of-s residue encoder
r-out-of-s residue decoder
permutation network
X r
X2
X1 R
R m-b
it in
put (
X)
n-bit input (Y)
m-b
it re
sult
(R)
y1 y2 yr
r
R
••• ••••••
•••
••• ••
•
•••
•••
•••
•••
•••
•••
•••
1
2
• • •
• • •
Figure 1: Organization of a permutation network arithmetic processor.
output in each Xi to 1 and all other outputs to 0. More precisely, the jth output in
Xi is set to 1, where j = X mod mi. The r-out-of-s residue decoder is an r-out-of-s
residue encoder whose inputs and outputs are switched. That is, it has r sets of
inputs R1, R2, . . . , Rr, each of which contains a 1 in exactly one of its inputs, and a
set of n outputs that represents an n-bit result. The 1’s in R1, R2, . . . , Rr are com-
bined into an n-bit result whose residues with respect to moduli m1,m2, . . . ,mr
are indicated by the positions of 1’s in R1, R2, . . . , Rr. The variable s specifies the
number of outputs (inputs) of the r-out-of-s residue encoder (decoder). That is,
s = m1 +m2 + . . . +mr.
The center stage of the permutation network arithmetic processor consists of a
permutation network with s inputs and s outputs. In addition to s inputs that
are connected to the outputs of the r-out-of-s residue encoder, the permutation
network also has an n-bit control input that represents the second operand Y to
the arithmetic processor. The permutation network encompasses r subnetworks
5
N1, N2, . . . , Nr, where the lines in Xi from the r-out-of-s residue encoder form the
inputs of network Ni and the lines in Ri form its outputs. Network Ni consists
of dlgmie stages of switches, numbered dlgmie − 1, . . . , 1, 0 in that order, fromleft to right, each having mi inputs and mi outputs such that the switch in stage
k, k = 0, 1, . . . , dlgmie − 1 has two switching states:state 0:input j is connected to output j, for all j = 0, 1, . . . ,mi − 1;state 1:input j is connected to output j+ 2k mod mi, for all j = 0, 1, . . . ,mi− 1.
Thus, the switch in stage k of network Ni realizes either the identity permutation
on its inputs (state 0) or the circular right shift permutation where all inputs are
circularly shifted to right by 2k mod mi positions (state 1). The permutations
that are realized by networks N1, N2, . . . , Nr are determined by the residues, yi =
Y mod mi; 1 ≤ i ≤ r. The residue yi is computed from Y by the binary residueencoder in Figure 1 which converts Y mod mi to its binary representation. The
kth least significant bit of yi then determines the state of the switch in the kth
stage in network Ni. If that bit is 0 then the stage is set to state 0 and if it is 1
then the stage is set to state 1.
In most of the computations that follow, we will need to cascade several permu-
tation network processors together. In such cases, the r-out-of-s residue encoders
and decoders are redundant and will be removed from the model. In this reduced
model, we only retain the permutation network and binary residue encoder.
B. The Cost and Time Assumptions
In the reduced permutation network processor model, each processor has an n-
bit input and r simple shift networks with m1,m2, . . . ,mr inputs. These net-
works when combined together provide a numerical range extending from 0 to
m1m2 . . .mr−1 for unsigned numbers and from−bm1m2 . . .mr/2c to bm1m2 . . .mr/2cfor signed numbers in 2’s complement form.
Let M = m1m2 . . .mr. It was shown in [10] that a permutation network processor
6
encompassing r subnetworks with m1,m2, . . . ,mr inputs can be constructed by
using O(lg2 M) logic gates. The two of the computations to be implemented on
permutation network processors, namely, the sum of N n-bit 2’s complement num-
bers and the inner product of two N -element vectors each of whose elements is an
n-bit 2’s complement number requires that N2n−1 ≈M. Thus, each of the N per-mutation network processors neededfor these two computations can be constructed
with O((lgN + n)2) logic gates.
The third computation, i.e., the product of two N×N matricies can be carried outby performing N2 inner product operations. Thus, the matrix multiplication prob-
lem can be solved by using N3 permutation network processors each constructed
from O((lgN + n)2) logic gates.
As for the delay of the permutation network processors needed for these three
computations, it was shown in [10] that a permutation network processor encom-
passing r subnetworks with m1,m2, . . . ,mr inputs has O(lg lgM) bit-level delay.
Given that M ≈ 2nN, it follows that all three computations mentioned above canbe performed by using permutation network processors with O(n+ lgN) bit-level
delay.
We point out that these bit-level cost and delay complexities are comparable with
those for other parallel computer models. The last two computations require a
multiplier circuit for n-bit operands and this exacts O(n2) bit-level cost to attain
O(lg n) bit-level delay regardless of the model used. Given this, in obtaining the
cost and time complexities of ther algorithms that follow, we will assume that our
permutation network processors have constant cost and constant time where the
cost and time are expressed in word level as in other parallel computer models.
7
3 Inner Product Processors
In this section, we show how to sum N n-bit numbers and compute real and
complex inner products using permutation network processors.
A. Summation of N n-bit numbers
Assume that N n-bit numbers to be added together are all in 2’s complement
form. This implies that the sum can be as large as 2n−1N and to avoid a possible
overflow, the dynamic range M of the permutation network processor must satisfy
2n−1N ≤M/2, that is, 2nN ≤M.
As described in [10], the set ZM = {0, 1, . . . ,M − 1} under addition modulo Mforms a group which is isomorphic to a cyclic permutation group (〈π〉, ·) of order Mand generated by a permutation π defined over {0, 1, . . . ,M−1}. The isomorphismbetween (ZM ,+M) and (〈π〉, ·) is fixed by mapping a generator of ZM onto π. Nowlet M = m1m2 . . .mr, where mi and mj are relatively prime for all i 6= j; 1 ≤ i, j ≤r. For ZM , we fix 1 as its generator and let π be the product of r disjoint cycles,
π1, π2, . . . , πr, where
π1 = (0 1 . . . m1 − 1)
πi = (m1 +m2 + . . . +mi−1 m1 +m2 + . . . +mi−1 + 1 . . .
m1 +m2 + . . . +mi−1 +mi − 1) 1 ≤ i ≤ r. (1)
Under the isomorphism fixed by mapping 1 to π, element X ∈ ZM is mapped toπX = (π1π2 . . .πr)
X or since π1, π2, . . . , πr are disjoint, X is mapped to πX1 π
X2 . . .π
Xr .
As a consequence, the sum X1 +X2 +. . .+XN modulo M, Xi ∈ ZM for 1 ≤ i ≤ Nis mapped to
πX1+MX2+M ...+MXN = πX1+MX2+M ...+MXN1 πX1+MX2+M ...+MXN2 , . . .
πX1+MX2+M ...+MXNr (2)
8
where +M denotes modulo M addition. Since πi is a cycle of length mi
πX1+MX2+M ...+MXN = πX1+m1X2+m1 ...+m1XN1 π
X1+m2X2+m2 ...+m2XN2 . . .
πX1+mrX2+mr ...+mrXNr (3)
or by Horner’s rule
πX1+MX2+M ...+MXN = π((X1+m1X2)+m1 ...+m1XN )1
π((X1+m2X2)+m2 ...+m2XN )2 . . .
π((X1+mrX2)+mr ...+mrXN )r , (4)
where +mi denotes modulo mi; 1 ≤ i ≤ r, addition.
To compute Equation (4), we cascade N permutation network processors together
and each operand Xi; 1 ≤ i ≤ N, is converted into its corresponding residue code(xi1, xi2, . . . , xir) by using a binary residue encoder such as one of those described
in [10]. These converted residue codes (xi1, xi2, . . . , xir); 1 ≤ i ≤ N, are then used toset up the switching states of corresponding permutation networks. The complete
algorithm for computing the summation of N n-bit numbers on a permutation
network processor is then given as follows.
Algorithm 1 (Summation of N n-bit 2’s complement numbers)
Input: N n-bit 2’s complement numbers, X1, X2, . . . , XN .
Output: Sum of X1, X2, . . . , XN in 2’s complement form.
Method:
Step 1: ConvertXi’s into their corresponding binary residue codes (xi1, xi2, . . . , xir),
in parallel.
Step 2: Add X1, X2, . . . , XN .
Step 2.1: Set up the switching states of permutation network processors in
parallel by the binary residue codes obtained in Step 1.
9
modulo 3 binary residue encoder
modulo 5 binary residue encoder
modulo 7 binary residue encoder
modulo 3 binary residue encoder
modulo 5 binary residue encoder
modulo 7 binary residue encoder
3-ou
t-of
-15
bina
ry
resi
due
deco
der
modulo 3 binary residue encoder
modulo 5 binary residue encoder
modulo 7 binary residue encoder
1X =13 2X =12 3X =9
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
result
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
processor 1 processor 2 processor 3
0 1 0 0 0 0
0 1 1 0 1 0 1 0 0
1 1 0 1 0 1 0 1 0
1
0
1000
0
Figure 2: The permutation network processor shown to compute the sum13 +105 12 +105 9 = 34.
Step 2.2: Feed all input lines marked 0,m1,m1+m2, . . . ,m1+m2+. . .+mr−1
with a “1” and all the other input lines with a “0.”
Step 3: Decode the r-out-of-s residue code obtained at the outputs of the last
permutation network processor in the cascade into its binary equivalent.
Figure 2 shows an example with N = 3, n = 5, and X1 = 13, X2 = 12, and X3 = 9.
The light lines indicate the paths of “1” between inputs and outputs. To avoid a
possible overflow, the dynamic range of the processor is chosen so that it satisfies
10
modulo 3 binary residue encoder
modulo 5 binary residue encoder
modulo 7 binary residue encoder
modulo 3 binary residue encoder
modulo 5 binary residue encoder
modulo 7 binary residue encoder
3-ou
t-of
-15
bina
ry
resi
due
deco
der
modulo 3 binary residue encoder
modulo 5 binary residue encoder
modulo 7 binary residue encoder
1X =13 2X = - 12 3X =9
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
result
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
processor 1 processor 2 processor 3
0 1 0 0 0 0
0 1 1 0 1 1 1 0 0
1 1 0 0 1 0 0 1 0
1
00010
0
Figure 3: The permutation network processor shown to compute13−105 12 +105 9 = 10.
2nN ≤M, that is, 3×25 ≤M. Therefore, M is picked as 105 = 3×5×7. We notethat the summands can be negative as well. To illustrate this, Figure 3 depicts
how 13−105 12 +105 9 = 10 is computed on the same network given in Figure 2.
This algorithm requires O(N) processors and since it does not contain any loop
and each step takes O(1) time, the total execution time of this algorithm is O(1).
B. Real Inner-Product Processor
Let X = (X1, X2, . . . , XN) and Y = (Y1, Y2, . . . , YN) be two N -element vectors,
where Xi, Yi ∈ ZM = {0, 1, . . . ,M − 1}. Then the real inner product of X and Y
11
is a real-valued function defined as
X ·Y =N∑j=1
XjYj. (5)
To compute XjYj on a permutation network processor, we recall that the multi-
plication modulo M over the set ZM forms a monoid. Let Zmi = {0, 1, 2, . . . ,m1−1} and (Zmi ,×mi) denote the monoid under modulo mi multiplication of ele-ments in Zmi ; 1 ≤ i ≤ r. Let M = m1m2 . . .mr, where m1,m2, . . . ,mr are allprimes. Let ZM = Zm1 ⊗ Zm2 ⊗ . . . ⊗ Zmr denote the direct product of Zmi ; 1 ≤i ≤ r, whose elements are r-tuples (x1, x2, . . . , xr), where xi ∈ Zmi . For any(x1, x2, . . . , xr), (y1, y2, . . . , yr) ∈ ZM define
(x1, x2, . . . , xr) ⊗ (y1, y2, . . . , yr) =
(x1×m1y1, x2×m2y2, . . . , xr×mryr). (6)
Now we carry over the product XiYi onto (ZM ,⊗) by setting up a monoid isomor-phism f between Zm and ZM as follows:
f(Xj) = (xj,1, xj,2, . . . , xj,r), ∀Xj ∈ ZM , 1 ≤ j ≤ N (7)
where xj,i = Xj mod mi; 1 ≤ i ≤ r. Let Xj, Yj ∈ ZM . It is easy to show that
f(1) = (1, 1, . . . , 1) and
f(Xj · Yj) = f(Xj)⊗ f(Yj) (8)
and hence f is an isomorphism, and using this isomorphism, we can compute the
real inner product X ·Y in ZM as
X ·Y =N∑j=1
XjYj
=N∑j=1
(xj,1, xj,2, . . . , xj,r)⊗ (yj,1, yj,2, . . . , yj,r)
= (N∑j=1
xj,1×m1yj,1,N∑j=1
xj,2×m2yj,2, . . . ,N∑j=1
xj,r×mryj,r). (9)
12
0
1
2
3
4
0
1
2
3
OR
0
1
2
3 1-ou
t-of
-5 r
esid
ue d
ecod
er
0
1
2
3
4
AND
AND
AND
AND
1
2
3
4
1-ou
t-of
-4
resi
due
deco
der
0
1
0
1
2
3
4
2
3
(generator =2)
g
g
g-1
1
0
0(binary
residue code )
(1-out-of-5 residue code)
(1-o
ut-o
f-5
resi
due
code
)in
put (
X=
3)
input (Y = 3)
output (R=4)
Figure 4: Permutation network modulo 5 multiplier.
The corresponding permutation network real inner product processor is constructed
as follows. For each modulus, mi, N modulo mi permutation network processors
are cascaded. One operand of each processor comes from the previous stage and
the other operand comes from the output of a modulo mi permutation network
multiplier. An example of modulo 5 permutation network multiplier is shown in
Figure 4 and a detailed description of modulo mi permutation network multiplier
can be found in [10].
The following algorithm shows how the inner product is carried out using such
multipliers.
Algorithm 2 (Computing the real inner product of two vectors)
Input: Two vectors X = (X1, X2, . . . , XN) and Y = (Y1, Y2, . . . , YN), where each
Xi and Yi is an n-bit 2’s complement number.
Output: The real inner product of X and Y represented in 2’s complement form.
Method:
Step 1: Convert each element of X and Y into its corresponding residue code in
parallel.
13
Step 2: Multiply xji and yji for 1 ≤ j ≤ N and 1 ≤ i ≤ r in parallel.
Step 3: Compute the inner product by adding together the products obtained in
Step 2 over the residue code domain.
Step 3.1: Set up the switching states of permutation network processors in
parallel by the binary residue codes obtained in Step 2.
Step 3.2: Feed all input lines marked 0,m1,m1+m2, . . . ,m1+m2+. . .+mr−1
with a “1” and all the other input lines with a “0.”
Step 4: Decode the r-out-of-s residue code obtained at the outputs of the last
permutation network processor in the cascade into its binary equivalent.
An example with N = 3 and n = 3 is shown in Figure 5. It is easy to see that
Algorithm 2 requires O(N) processors and its total execution time is O(1) since
each step takes O(1) time and there is no loop inside the algorithm.
C. Complex Inner-Product Processor
The inner product of two complex vectors U = (U1, U2, . . . , UN) and V = (V1, V2, . . . , VN)
is a complex-valued function defined as follows:
U ·V =N∑j=1
UjV∗j , (10)
where Uj and Vj; 1 ≤ j ≤ N, are complex elements, and V ∗j denotes the complexconjugate of Vj.
Let Uj = Re(Uj) + iIm(Uj) and Vj = Re(Vj) + iIm(Vj), where i =√−1, then
U ·V =N∑j=1
{Re(Uj) + iIm(Uj)}{Re(Vj)− iIm(Vj)}
=N∑j=1
{Re(Uj)Re(Vj) + Im(Uj)Im(Vj)}
+iN∑j=1
{Im(Uj)Re(Vj)−Re(Uj)Im(Vj)}. (11)
14
3-ou
t-of
-15
bina
ry
resi
due
deco
der
012
34567
89
1011121314
××3
××5
××7
××3
××5
××7
××3
××5
××7
012
34567
89
1011121314
1X 1Y 2X 2Y 3X 3Y
result
i1-out-of-m residue encoder modulo i permutation network multiplier×× i
processor 1 processor 2 processor 3
mod 7 permutation
network adder
mod 7 permutation
network adder
mod 7 permutation
network adder
mod 5 permutation
network adder
mod 5 permutation
network adder
mod 5 permutation
network adder
mod 3 permutation
network adder
mod 3 permutation
network adder
mod 3 permutation
network adder
Figure 5: The permutation network processor for computing the real inner-productof X and Y (N = 3 and n = 3.)
15
Therefore, the complex inner product can be computed by carrying out four real
inner products. Let ⊕ and ª be binary operators defined on ZM by
(x1, x2, . . . , xr) ⊕ (y1, y2, . . . , yr) =
(x1 +m1 y1, x2 +m2 y2, . . . , xr +mr yr) (12)
(x1, x2, . . . , xr) ª (y1, y2, . . . , yr) =
(x1 −m1 y1, x2 −m2 y2, . . . , xr −mr yr), (13)
where (x1, x2, . . . , xr), (y1, y2, . . . , yr) ∈ ZM . Using the isomorphism constructed inthe previous section, each real inner product can then be computed in ZM as in
Equation (9), and hence
U ·V =N∑j=1
{Re(uj,1, uj,2, . . . , uj,r)⊗Re(vj,1, vj,2, . . . , vj,r)
⊕Im(uj,1, uj,2, . . . , uj,r)⊗ Im(vj,1, vj,2, . . . , vj,r)}+
iN∑j=1
{Im(uj,1, uj,2, . . . , uj,r)⊗Re(vj,1, vj,2, . . . , vj,r)
ªRe(uj,1, uj,2, . . . , uj,r)⊗ Im(vj,1, vj,2, . . . , vj,r)} .
=
N∑j=1
(Re(uj,1)×m1Re(vj,1) +m1 Im(uj,1)×m1Im(vj,1)),
N∑j=1
(Re(uj,2)×m2Re(vj,2) +m2 Im(uj,2)×m2Im(vj,2)), . . . ,
N∑j=1
(Re(uj,r)×mrRe(vj,r) +mr Im(uj,r)×mrIm(vj,r))+
i
N∑j=1
(Im(uj,1)×m1Re(vj,1)−m1 Re(uj,1)×m1Im(vj,1)),
N∑j=1
(Im(uj,2)×m2Re(vj,2)−m2 Re(uj,2)×m2Im(vj,2)), . . . ,
N∑j=1
(Im(uj,r)×mrRe(vj,r)−mr Re(uj,r)×mrIm(vj,r)) . (14)
The complex inner product processor can then be constructed by computing the
real and imaginary parts on two permutation network real inner product proces-
16
sors. These real inner product processors require some minor modifications. For
the real part, the actual inner product consists of two real inner products, one
combining the real parts of U and V and the other combining their imaginary
parts. Thus the subnetwork for each modulus in each processor is obtained by
cascading two multiplication and two addition units. For the imaginary part, the
actual inner product also consists of two real inner products, however, these two
inner products are not added together; rather the second one is subtracted from
the first one. Therefore, we cascade one multiplication and addition unit with a
multiplication and subtraction unit. The subtraction unit can be implemented
by a circular left shift permutation network subtracter as described in [10]. The
general structure for the permutation network complex inner product processor is
shown in Figures 6 and 7. Figure 6 is the real part and Figure 7 is the imaginary
part. We note that the sum expressions on the left hand side are just for labeling
the inputs and do not imply any summation.
Algorithm 3 (Computing the complex inner product of two vectors)
Input: Two vectors U = (U1, U2, . . . , UN) and V = (V1, V2, . . . , VN). Uj and Vj
are complex numbers in the form: a+ ib, where a and b are n-bit 2’s complement
numbers.
Output: The complex inner product of U and V represented in a form c + id,
where c and d are 2’s complement numbers.
Method:
Step 1: Convert each element of U and V into its corresponding residue code in
parallel. (Note that the real part and imaginary part use separate residue
decoders.)
Step 2: Compute the real multiplications, Re(uji)Re(vji), Im(uji)Im(vji),
Im(uji)Re(vji), and Re(uji)Im(vji), for 1 ≤ j ≤ N and 1 ≤ i ≤ r in parallel.
Step 3: Compute the real and imaginary parts of the complex inner product by
17
××m
1××
m1
××m
i××
mi
××m
r××
mr
Re(
U ) 1
Re(
V ) 1
Im(V
) 1Im
(U ) 1
•••
•••
•••
•••
××m
1××
m1
××m
i××
mi
××m
r××
mr
Re(
U ) 2
Re(
V ) 2
Im(V
) 2Im
(U ) 2
•••
•••
•••
•••
××m
1××
m1
××m
i××
mi
××m
r××
mr
Re(
U ) N
Re(
V ) N
Im(V
) NIm
(U ) N
•••
•••
•••
•••
r-out-of-s binary residue decoder
Re(
R)
•••
•••
•••
•••
•••
•••
•••
•••
•••
•••
•••
•••
•••
•••
•••
•••
•••
•••
•••
•••
•••
•••
•••
•••
proc
esso
r 1
proc
esso
r 2
proc
esso
r N
1-ou
t-of
-m
resi
due
enco
der
mod
ulo
i per
mut
atio
n ne
twor
k m
ultip
lier
××i
PN
: per
mut
atio
n ne
twor
ki
0 1 −m
11
∑i-1 k=
1m
k
∑i-1 k=
1m
k+1
∑i k=
1m
k-1
∑r-
1k=
1m
k
∑r-
1k=
1m
k+1
∑r k=
1m
k-1
mod
m
PN
ad
der
1m
od m
P
N
adde
r
1m
od m
P
N
adde
r
1m
od m
P
N
adde
r
1m
od m
P
N
adde
r
1m
od m
P
N
adde
r
1
mod
m
PN
ad
der
im
od m
P
N
adde
r
im
od m
P
N
adde
r
im
od m
P
N
adde
r im
od m
P
N
adde
r
im
od m
P
N
adde
r
i
mod
m
PN
ad
der
rm
od m
P
N
adde
r
rm
od m
P
N
adde
r
rm
od m
P
N
adde
r
rm
od m
P
N
adde
r
rm
od m
P
N
adde
r
r
Figure 6: The real part of a permutation network complex inner-product processor.
18
××m
1××
m1
××m
i××
mi
××m
r××
mr
Re(
U ) 1
Re(
V ) 1
Im(V
) 1Im
(U ) 1
•••
•••
•••
•••
××m
1××
m1
××m
i××
mi
××m
r××
mr
Re(
U ) 2
Re(
V ) 2
Im(V
) 2Im
(U ) 2
•••
•••
•••
•••
××m
1××
m1
××m
i××
mi
××m
r××
mr
Re(
U ) N
Re(
V ) N
Im(V
) NIm
(U ) N
•••
•••
•••
•••
r-out-of-s binary residue decoder
Im(R
)
•••
•••
•••
•••
•••
•••
•••
•••
•••
•••
•••
•••
•••
•••
•••
•••
•••
•••
•••
•••
•••
•••
•••
•••
1-ou
t-of
-m
resi
due
enco
der
mod
ulo
i per
mut
atio
n ne
twor
k m
ultip
lier
××i
PN
: per
mut
atio
n ne
twor
k
proc
esso
r 1
proc
esso
r 2
proc
esso
r N
0 1
−m
11
∑i-1 k=
1m
k
∑i-1 k=
1m
k+1
∑i k=
1m
k-1
∑r-
1k=
1m
k
∑r-
1k=
1m
k+1
∑r k=
1mk-
1
mod
m
PN
su
btra
cter1
mod
m
PN
ad
der
1m
od m
P
N
subt
ract
er1m
od m
P
N
adde
r
1m
od m
P
N
subt
ract
er1m
od m
P
N
adde
r
1
mod
m
PN
ad
der
im
od m
P
Nsu
btra
cter
im
od m
P
N
adde
r
im
od m
P
Nsu
btra
cter
im
od m
P
N
adde
r
im
od m
P
Nsu
btra
cter
i
mod
m
PN
ad
der
rm
od m
P
N
subt
ract
er
rm
od m
P
N
adde
r
rm
od m
P
N
subt
ract
er
rm
od m
P
N
adde
r
rm
od m
P
N
subt
ract
er
r
Figure 7: The imaginary part of a permutation network complex inner-productprocessor.
19
adding together the corresponding products obtained in Step 2 according to
Equation (14) over the residue code domain.
Step 3.1: Set up the switching states of permutation network processors
(adders and subtracters) in parallel by the binary residue codes obtained
in Step 2.
Step 3.2: Feed all input lines marked 0,m1,m1+m2, . . . ,m1+m2+. . .+mr−1
with a “1” and all the other input lines with a “0.”
Step 4: Decode the r-out-of-s residue codes obtained at the outputs of the last
permutation network processors in the real and imaginary parts into their
binary equivalents.
As in the previous algorithm, it is easy to see that total execution time of Algorithm
3 is O(1). Its total cost is O(N) as the real part and the imaginary part require
N processors, respectively. In more exact terms, for the real part, each processor
consists of two permutation network adders in cascade. For the imaginary part,
each processor consists of a permutation network adder and a permutation network
subtracter. Therefore, a total of 4N processors are required by this algorithm.
4 Matrix Multiplication
In this section, we use permutation network inner product processors to carry out
real and complex matrix multiplications.
A. Real Matrix Multiplication
Let Aj be the jth row of A and Bk be the kth column of B. Let A and B be
N ×N real matrices and C = A×B. C can be computed by using the followingprocedure:
For j = 1 to N do in parallel
For k = 1 to N do in parallel
20
Cj,k = Aj ·BkEndfor
Endfor.
Hence, if we use N2 inner product processors in parallel, the multiplication of two
real matrices can be computed in O(1) steps as follows.
Algorithm 4 (Computing the real matrix multiplication of two real ma-
trices)
Input: Two N ×N real matrices A and B. Each element of A and B is an n-bit2’s complement number.
Output: A real product matrix C = A×B.Method:
Step 1: Convert each element of A and B into its corresponding residue code in
parallel.
Step 2: Compute the N2 inner products using Algorithm 2.
A schematic version of this algorithm is depicted in Figure 8. As can be seen from
the figure, all inner product processors operate in parallel. Since each executes in
O(1) time, Algorithm 4 takes O(1) time to execute. It needs a total of N2 inner
product processors each having O(N) cost and hence its cost is O(N3).
B. Complex Matrix Multiplication
Now we consider the multiplication of two complex matrices. Let R = A+ iB and
S = C + iD be two complex matrices of size N ×N. Let T = RS = E + iF be theproduct matrix. Then
T = RS = (A + iB)(C + iD)
= (AC−BD) + i(AD + BC), (15)
and hence the complex matrix multiplication is reduced to four real matrix multi-
plications, one addition, and one subtraction. Let Aj and Bj denote the jth rows
21
• • •
• • •
• • •
A1
A2
AN
C = A B2,1 12•
C = A BN,1 1N• C =A BN,2 2N• C =A BN,N NN•
C = A B2,2 22• C = A B2,N N2•
C = A B1,1 11• C = A B1,2 21• C = A B1,N N1•
• • •
• • •
• • •
• • •
B 1 B 2 BN• • •
Figure 8: Real matrix multiplication on permutation network inner-product pro-cessors. (Each rectangular box represents a permutation network inner productprocessor).
of A and B, respectively and Ck and Dk denote the jth columns of C and D,
respectively. Thus, to compute E = AC−BD, we use the following procedure.
For j = 1 to N do in parallel
For k = 1 to N do in parallel
Ej,k = AjCk −BjDkEndfor
Endfor.
The inner equation is a difference of two inner products. Comparing this to the
imaginary part of the complex inner product, it is easy to see that they have the
same algebraic form. Therefore, matrix E can be computed using Algorithm 4 by
replacing the inner product processor in the algorithm with the imaginary part of
the complex inner product processor shown in Figure 7. This amounts to replacing
each of the inner product processors in Figure 8 by the imaginary inner product
processor given in Figure 7.
22
Likewise, to compute F = AD + BC, we use the procedure:
For j = 1 to N do in parallel
For k = 1 to N do in parallel
Fj,k = AjDk + BjCk
Endfor
Endfor.
The inner equation entails two inner products as in the real part of the complex
inner product processor shown in Figure 6. Thus F can be computed using Algo-
rithm 4.
Combining these facts, we conclude that the multiplication of two N ×N complexmatrices can be carried in O(1) time using O(N3) permutation network processors.
5 Concluding Remarks
In this paper, we proposed permutation network processors to compute algebraic
sums, inner and matrix products. It has been shown that the algebraic sum of N n-
bit 2’s complement numbers can be computed inO(1) time on such a processor with
O(N) cost. The inner product of two N -element vectors (both real and complex)
with n-bit elements can also be done in O(1) time using a similar processor with
O(N) cost. On the other hand, the matrix multiplication takes O(1) time but with
O(N3) permutation network processors.
These results are important in that they establish that one can avoid using conven-
tional adder and multiplier circuits to carry out vector and matrix computations.
It will be worthwhile to extend them to other computations such as discrete trans-
forms, convolutions, and correlation computations. We anticipate that all of these
computations can also be done in O(1) time and O(N2) cost and we plan to present
our results on these computations in another place.
23
References
[1] Selim G. Akl,“The design and analysis of parallel algorithms,” Prentice-Hall
Pub., Englewood Cliffs, N. J., 1989.
[2] D. M. Champion and J. Rothstein,“Immediate parallel solution of the longest
common subsequence problem,” Proc. IEEE Int. Conf. Parallel Processing,
pp. 70-77, 1987.
[3] Gen-Huey Chen, Bing-Feng Wang, and Chi-Jen Lu, “On the parallel compu-
tation of the algebraic path problem,”IEEE Trans. Parallel and Distributed
Systems, vol. 3 , pp. 251-256, March 1992.
[4] K. M. Elleithy, M. A. Bayoumi, and K. P. Lee,“θ(lgN) architecture for RNS
arithmetic decoding,” IEEE 9th Symposium on Computer Arithmetic, pp.
202-209, 1989.
[5] John L. Gustafson,“Computer-intensive processors and multicomputers,” in
Parallel processing for supercomputers and artificial intelligence, Kai Hwang
and Douglas Degroot, Eds. pp. 169-201, 1989.
[6] Kai Hwang, “Computer arithmetic: principle, architecture, and design,”
John Wiley & Sons Pub., 1979.
[7] Kai Hwang and Faye A. Briggs, “Computer architecture and parallel process-
ing,” McGraw-Hill Pub., 1984.
[8] S. Lakshmivarahan and Sudarshan K. Dhall,“Analysis and design of parallel
algorithms,” McGraw-Hill Pub., 1990.
[9] Ming-Bo Lin and A. Yavuz Oruç,“The design of a network-based arithmetic
processor,” UMIACS-TR-91-141, University of Maryland, College Park, MD,
October 1991.
24
[10] Ming-Bo Lin, “Unified Algebraic Computations on Permutation Networks,”
Ph.D. Dissertation, Electrical Engineering Department, University of Mary-
land, College Park, 1992.
[11] Massimo Maresca and Hungwen Li, “Connection autonomy in SIMD comput-
ers: A VLSI implementation,”Journal of Parallel and Distributed Computing,
vol. 7, pp. 302-320, 1989.
[12] R. M. Merseran and D. E. Dudgeon, “Multidimensional signal processing,”
Prentice-Hall Pub., Englewood Cliffs, N. J., 1985.
[13] R. Miller, V. K. Prasanna Kumar, D. Reisis, and Q. F. Stout,“Data movement
operations and applications on reconfigurable VLSI arrays,”Proc. IEEE Int.
Conf. Parallel Processing, pp. 205-208, 1988.
[14] A. V. Oppenheim and R. W. Schafer,“Digital signal processing,” Prentice-
Hall Pub., Englewood Cliffs, N. J., 1975.
[15] A. Yavuz Oruç, Vinod G. J. Peris, and M. Yaman Oruç, “Parallel modu-
lar arithmetic on a permutation network,” Proceedings of 1991 International
Conference on Parallel Processing, vol. 1, pp. 706-707, St. Charles, IL, Aug.
1991.
[16] Vinod G. J. Peris,“Fast modular arithmetic on permutation networks,” Mas-
ter Thesis, Electrical Engineering Department, University of Maryland, Col-
lege Park,MD, 1992.
[17] Stanislaw J. Piestrak,“Design of residue generators and multi-operand mod-
ular adders using carry-save adders,” IEEE 10th Symposium on Computer
Arithmetic, pp. 100-107, 1991.
[18] R. J. Schalkoff,“Digital image processing and computer vision,” John Wiley
& Sons Pub., inc., New-York, 1989.
25
[19] W. Shen and A. Yavuz Oruç, “Mapping algebraic formulas onto mesh con-
nected processor networks,” Information Sciences and Systems, Princeton
University, N.J., pp.535-538, 1986.
[20] S. P. Smith and H. C. Torng, “Design of a fast inner product processor,”Proc.
IEEE 7th Symposium on Computer Arithmetic, pp. 38-43, 1985.
[21] E. E. Swartzlander, Jr., Barry K. Gilbert, and Irving S. Reed, “Inner product
computers,”IEEE Trans. Computers, vol. C-27, pp. 21-31, January 1978.
[22] B. F. Wang, C. J. Lu, and G. H. Chen, “Constant time algorithms for the
transitive closure problem and its applications,”Proc. IEEE Int. Conf. Par-
allel Processing, pp. 52-59, 1990.
[23] B. F. Wang, G. H. Chen, and F. C. Lin, “Constant time sorting on a processing
array with a reconfigurable bus system,”Information Processing Letters, vol.
34, pp. 187-192, 1990.
[24] B. F. Wang and G. H. Chen, “Constant time algorithms for the transitive
closure and some related graph problems on processor arrays with reconfig-
urable bus systems,”IEEE Trans. Parallel and Distributed Systems, vol. 1,
pp. 500-507, October 1990.
[25] B. F. Wang and G. H. Chen, “Two-dimensional processor array with a recon-
figuration bus system is at least as powerful as CRCW model,”Information
Processing Letters, vol. 36, pp. 31-36, October 1990.
[26] B. F. Wang, G. H. Chen, and Hungwen Li, “Configurational computation:
a new computation method on processor arrays with reconfigurable bus sys-
tems,”Proc. IEEE Int. Conf. Parallel Processing, vol. III pp. 42-49, 1991.
26