26
CONSTANT TIME INNER PRODUCT AND MATRIX COMPUTATIONS ON PERMUTATION NETWORK PROCESSORS * Ming-Bo Lin Electronic Engineering Department National Taiwan Institute of Technology 43, Keelung Road Section 4, Taipei, Taiwan and A. Yavuz Oru¸ c Electrical Engineering Department Institute of Advanced Computer Studies University of Maryland, College Park, MD 20742-3025 E-mail: [email protected] Telephone: (301) 405-3663 ABSTRACT Inner product and matrix operations find extensive use in algebraic computations. In this paper, we introduce a new parallel computation model, called a permu- tation network processor, to carry out these computations efficiently. Unlike the traditional parallel computer architectures, computations on this model are car- ried out by composing permutations on permutation networks. We show that the sum of N algebraic numbers on this model can be computed in O(1) time using N processors. We further show that the real and complex inner product and matrix multiplication can both be computed on this model in O(1) time at the cost of O(N ) and O(N 3 ), respectively, for N element vectors, and N × N matrices. These results compare well with the time and cost complexities of other high level parallel computer models such as PRAM and CRCW PRAM. Index Terms: complex inner product, complex matrix multiplication, permutation networks, real inner product, and real matrix multiplication. * This work is supported in part by the Ministry of Education, Taipei, Taiwan, Republic of China and in part by the Minta Martin Fund of the School of Engineering at the University of Maryland. 1

Ming-Bo Lin A. Yavuz Oru»c - UMDyavuz/research/researchpapers/... · 1999. 8. 28. · Ming-Bo Lin Electronic Engineering Department National Taiwan Institute of Technology 43, Keelung

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

  • CONSTANT TIME INNER PRODUCT AND MATRIX

    COMPUTATIONS ON PERMUTATION NETWORK PROCESSORS

    Ming-Bo LinElectronic Engineering Department

    National Taiwan Institute of Technology43, Keelung Road Section 4, Taipei, Taiwan

    andA. Yavuz Oruç

    Electrical Engineering DepartmentInstitute of Advanced Computer Studies

    University of Maryland, College Park, MD 20742-3025E-mail: [email protected] Telephone: (301) 405-3663

    ABSTRACT

    Inner product and matrix operations find extensive use in algebraic computations.In this paper, we introduce a new parallel computation model, called a permu-tation network processor, to carry out these computations efficiently. Unlike thetraditional parallel computer architectures, computations on this model are car-ried out by composing permutations on permutation networks. We show that thesum of N algebraic numbers on this model can be computed in O(1) time using Nprocessors. We further show that the real and complex inner product and matrixmultiplication can both be computed on this model in O(1) time at the cost ofO(N) and O(N3), respectively, for N element vectors, and N ×N matrices. Theseresults compare well with the time and cost complexities of other high level parallelcomputer models such as PRAM and CRCW PRAM.Index Terms: complex inner product, complex matrix multiplication, permutationnetworks, real inner product, and real matrix multiplication.

    ∗This work is supported in part by the Ministry of Education, Taipei, Taiwan, Republic ofChina and in part by the Minta Martin Fund of the School of Engineering at the University ofMaryland.

    1

  • 1 Introduction

    Inner product and matrix operations form the core of computations of vector and

    array processors [5,7]. Many algorithms in signal and image processing, pattern

    recognition, computer vision, and computer tomography require extensive use of

    such operations [5,12,14,18]. Therefore, it is desirable to find novel parallel archi-

    tectures to compute these operations fast and efficiently.

    Traditional architectures for carrying out such operations are based on reducing

    vector computations into scalar operations such as binary addition and multiplica-

    tion [20,21]. As a result, much of the computations in vector and array processors

    is handled by conventional arithmetic circuits such as carry look ahead adders,

    recoded and cellular array multipliers and dividers [6]. While these conventional

    circuits are optimized for speed and hardware, they still rely on a variety of building

    blocks such as adder, subtractor and multiplier cells which often lead to nonuniform

    arithmetic circuits for vector processors.

    In this paper, we propose a new concept to carry out vector and matrix computa-

    tions. Unlike the traditional architectures, this concept is based on coding not only

    the operands but also the operations over the operands in such a way that a vector

    or matrix computation reduces to composing permutation maps. Each operand is

    coded into a permutation and addition or multiplication of two operands is car-

    ried out by composing the permutations that correspond to these operands on a

    permutation network. As a result, both addition and multiplication are reduced

    to a single computation, i.e., that of composing permutations. Furthermore, any

    other computation involving addition, subtraction and multiplication operations

    are also reduced to just composing permutations.

    We show that, on this new computation model, called a permutation network

    processor, the sum of N n-bit numbers, the inner product of two vectors, each

    containing N n-bit elements, and the multiplication of two N × N matrices with

    2

  • n-bit entries can all be computed in O(1) steps. The first two computations require

    O(N) processors and the matrix multiplication requires (N3) processors, where

    each processor handles an n-bit input, and has O((n + lgN)2) bit-level cost and

    O(n+ lgN) bit-level delay.

    We note that these results compare well with the complexities for the same compu-

    tations on other models. For example, on a PRAM model [1,8], all three computa-

    tions take O(lgN) time with the same numbers of processors, where each processor

    has two O(n+ lgN)-bit inputs, and uses arithmetic circuits with O(n2 + lgN) bit-

    level cost. On a cube-connected parallel computer, the same three computations

    also take O(lgN) time with the same numbers of processors and with the same pro-

    cessor bit-level complexity [1]. On the combining CRCW PRAM, the same three

    computations can all be done in O(1) time and with the same numbers of proces-

    sors, where each processor has two O(n+ lgN)-bit inputs, and with O(n2 + lgN)

    bit-level cost. In addition, this model must have a circuit to combine up to N

    concurrent writes.

    We also note that, even though the permutation network processor model stands

    on its own, it ties with some earlier computation models that were reported in the

    literature. One such model, called a processing network, was given [19] where a

    mesh of processing elements was used to compute certain algebraic formulas. The

    processing elements in this model can be programmed for arithmetic and routing

    functions whose combinations lead to various algebraic expressions on the mesh

    topology. More recently, a new parallel computer model, called a reconfigurable

    bus system, has been introduced to solve a wide range of problems including sorting

    problems [23], graph problems [13,3], and string problems [2]. All these problems

    have been shown to be solvable in O(1) time on the reconfigurable bus system

    model. As in the processing network model, processors are connected in this model

    by some fixed topology such as the mesh, and each processor can be programmed

    for some data processing as well as routing functions. It is assumed that the

    3

  • signals can be broadcast between processors in constant time regardless of how far

    the broadcast is carried [3,11],[22]-[26]. The essence of this assumption is that

    once the processors are simultaneously programmed for some routing functions,

    the signals that pass through them only encounter a propagation delay which is

    short enough so as to be considered a constant. The same assumption also holds

    for our model.

    The remainder of this paper is organized as follows. In Section 2, we provide a brief

    overview of permutation network processors. In Section 3, we show how real and

    complex inner products can be performed on our permutation network processor

    in O(1) time. In this section, we also describe how to sum N n-bit signed numbers

    in O(1) time on the same model. In Section 4, we use these inner product units

    to carry out the multiplication of two N ×N matrices in O(1) time using O(N3)permutation network processors. The paper is concluded in Section 5.

    2 The Permutation Network Processor Model

    The computations to be described in subsequent sections all rely on a permutation

    network processor model which was introduced in [10,15,16]. Here we give a brief

    overview of this model and introduce some changes so as to make it suitable for

    these computations. The reader is referred to these citations for a more detailed

    account.

    A. The Model

    A permutation network processor is obtained by cascading three components to-

    gether as shown in Figure 1: an r-out-of-s residue encoder, a permutation network

    stage, and an r-out-of-s residue decoder, where r and s are positive integers. The

    r-out-of-s residue encoder has m inputs, representing an m-bit number X, and r

    sets of outputs, X1, X2, . . . , Xr, where Xi contains mi outputs, for i = 1, 2, . . . , r.

    Based on the value of its input X, the r-out-of-s residue encoder sets exactly one

    4

  • binary residue encoder

    r-out-of-s residue encoder

    r-out-of-s residue decoder

    permutation network

    X r

    X2

    X1 R

    R m-b

    it in

    put (

    X)

    n-bit input (Y)

    m-b

    it re

    sult

    (R)

    y1 y2 yr

    r

    R

    ••• ••••••

    •••

    ••• ••

    •••

    •••

    •••

    •••

    •••

    •••

    •••

    1

    2

    • • •

    • • •

    Figure 1: Organization of a permutation network arithmetic processor.

    output in each Xi to 1 and all other outputs to 0. More precisely, the jth output in

    Xi is set to 1, where j = X mod mi. The r-out-of-s residue decoder is an r-out-of-s

    residue encoder whose inputs and outputs are switched. That is, it has r sets of

    inputs R1, R2, . . . , Rr, each of which contains a 1 in exactly one of its inputs, and a

    set of n outputs that represents an n-bit result. The 1’s in R1, R2, . . . , Rr are com-

    bined into an n-bit result whose residues with respect to moduli m1,m2, . . . ,mr

    are indicated by the positions of 1’s in R1, R2, . . . , Rr. The variable s specifies the

    number of outputs (inputs) of the r-out-of-s residue encoder (decoder). That is,

    s = m1 +m2 + . . . +mr.

    The center stage of the permutation network arithmetic processor consists of a

    permutation network with s inputs and s outputs. In addition to s inputs that

    are connected to the outputs of the r-out-of-s residue encoder, the permutation

    network also has an n-bit control input that represents the second operand Y to

    the arithmetic processor. The permutation network encompasses r subnetworks

    5

  • N1, N2, . . . , Nr, where the lines in Xi from the r-out-of-s residue encoder form the

    inputs of network Ni and the lines in Ri form its outputs. Network Ni consists

    of dlgmie stages of switches, numbered dlgmie − 1, . . . , 1, 0 in that order, fromleft to right, each having mi inputs and mi outputs such that the switch in stage

    k, k = 0, 1, . . . , dlgmie − 1 has two switching states:state 0:input j is connected to output j, for all j = 0, 1, . . . ,mi − 1;state 1:input j is connected to output j+ 2k mod mi, for all j = 0, 1, . . . ,mi− 1.

    Thus, the switch in stage k of network Ni realizes either the identity permutation

    on its inputs (state 0) or the circular right shift permutation where all inputs are

    circularly shifted to right by 2k mod mi positions (state 1). The permutations

    that are realized by networks N1, N2, . . . , Nr are determined by the residues, yi =

    Y mod mi; 1 ≤ i ≤ r. The residue yi is computed from Y by the binary residueencoder in Figure 1 which converts Y mod mi to its binary representation. The

    kth least significant bit of yi then determines the state of the switch in the kth

    stage in network Ni. If that bit is 0 then the stage is set to state 0 and if it is 1

    then the stage is set to state 1.

    In most of the computations that follow, we will need to cascade several permu-

    tation network processors together. In such cases, the r-out-of-s residue encoders

    and decoders are redundant and will be removed from the model. In this reduced

    model, we only retain the permutation network and binary residue encoder.

    B. The Cost and Time Assumptions

    In the reduced permutation network processor model, each processor has an n-

    bit input and r simple shift networks with m1,m2, . . . ,mr inputs. These net-

    works when combined together provide a numerical range extending from 0 to

    m1m2 . . .mr−1 for unsigned numbers and from−bm1m2 . . .mr/2c to bm1m2 . . .mr/2cfor signed numbers in 2’s complement form.

    Let M = m1m2 . . .mr. It was shown in [10] that a permutation network processor

    6

  • encompassing r subnetworks with m1,m2, . . . ,mr inputs can be constructed by

    using O(lg2 M) logic gates. The two of the computations to be implemented on

    permutation network processors, namely, the sum of N n-bit 2’s complement num-

    bers and the inner product of two N -element vectors each of whose elements is an

    n-bit 2’s complement number requires that N2n−1 ≈M. Thus, each of the N per-mutation network processors neededfor these two computations can be constructed

    with O((lgN + n)2) logic gates.

    The third computation, i.e., the product of two N×N matricies can be carried outby performing N2 inner product operations. Thus, the matrix multiplication prob-

    lem can be solved by using N3 permutation network processors each constructed

    from O((lgN + n)2) logic gates.

    As for the delay of the permutation network processors needed for these three

    computations, it was shown in [10] that a permutation network processor encom-

    passing r subnetworks with m1,m2, . . . ,mr inputs has O(lg lgM) bit-level delay.

    Given that M ≈ 2nN, it follows that all three computations mentioned above canbe performed by using permutation network processors with O(n+ lgN) bit-level

    delay.

    We point out that these bit-level cost and delay complexities are comparable with

    those for other parallel computer models. The last two computations require a

    multiplier circuit for n-bit operands and this exacts O(n2) bit-level cost to attain

    O(lg n) bit-level delay regardless of the model used. Given this, in obtaining the

    cost and time complexities of ther algorithms that follow, we will assume that our

    permutation network processors have constant cost and constant time where the

    cost and time are expressed in word level as in other parallel computer models.

    7

  • 3 Inner Product Processors

    In this section, we show how to sum N n-bit numbers and compute real and

    complex inner products using permutation network processors.

    A. Summation of N n-bit numbers

    Assume that N n-bit numbers to be added together are all in 2’s complement

    form. This implies that the sum can be as large as 2n−1N and to avoid a possible

    overflow, the dynamic range M of the permutation network processor must satisfy

    2n−1N ≤M/2, that is, 2nN ≤M.

    As described in [10], the set ZM = {0, 1, . . . ,M − 1} under addition modulo Mforms a group which is isomorphic to a cyclic permutation group (〈π〉, ·) of order Mand generated by a permutation π defined over {0, 1, . . . ,M−1}. The isomorphismbetween (ZM ,+M) and (〈π〉, ·) is fixed by mapping a generator of ZM onto π. Nowlet M = m1m2 . . .mr, where mi and mj are relatively prime for all i 6= j; 1 ≤ i, j ≤r. For ZM , we fix 1 as its generator and let π be the product of r disjoint cycles,

    π1, π2, . . . , πr, where

    π1 = (0 1 . . . m1 − 1)

    πi = (m1 +m2 + . . . +mi−1 m1 +m2 + . . . +mi−1 + 1 . . .

    m1 +m2 + . . . +mi−1 +mi − 1) 1 ≤ i ≤ r. (1)

    Under the isomorphism fixed by mapping 1 to π, element X ∈ ZM is mapped toπX = (π1π2 . . .πr)

    X or since π1, π2, . . . , πr are disjoint, X is mapped to πX1 π

    X2 . . .π

    Xr .

    As a consequence, the sum X1 +X2 +. . .+XN modulo M, Xi ∈ ZM for 1 ≤ i ≤ Nis mapped to

    πX1+MX2+M ...+MXN = πX1+MX2+M ...+MXN1 πX1+MX2+M ...+MXN2 , . . .

    πX1+MX2+M ...+MXNr (2)

    8

  • where +M denotes modulo M addition. Since πi is a cycle of length mi

    πX1+MX2+M ...+MXN = πX1+m1X2+m1 ...+m1XN1 π

    X1+m2X2+m2 ...+m2XN2 . . .

    πX1+mrX2+mr ...+mrXNr (3)

    or by Horner’s rule

    πX1+MX2+M ...+MXN = π((X1+m1X2)+m1 ...+m1XN )1

    π((X1+m2X2)+m2 ...+m2XN )2 . . .

    π((X1+mrX2)+mr ...+mrXN )r , (4)

    where +mi denotes modulo mi; 1 ≤ i ≤ r, addition.

    To compute Equation (4), we cascade N permutation network processors together

    and each operand Xi; 1 ≤ i ≤ N, is converted into its corresponding residue code(xi1, xi2, . . . , xir) by using a binary residue encoder such as one of those described

    in [10]. These converted residue codes (xi1, xi2, . . . , xir); 1 ≤ i ≤ N, are then used toset up the switching states of corresponding permutation networks. The complete

    algorithm for computing the summation of N n-bit numbers on a permutation

    network processor is then given as follows.

    Algorithm 1 (Summation of N n-bit 2’s complement numbers)

    Input: N n-bit 2’s complement numbers, X1, X2, . . . , XN .

    Output: Sum of X1, X2, . . . , XN in 2’s complement form.

    Method:

    Step 1: ConvertXi’s into their corresponding binary residue codes (xi1, xi2, . . . , xir),

    in parallel.

    Step 2: Add X1, X2, . . . , XN .

    Step 2.1: Set up the switching states of permutation network processors in

    parallel by the binary residue codes obtained in Step 1.

    9

  • modulo 3 binary residue encoder

    modulo 5 binary residue encoder

    modulo 7 binary residue encoder

    modulo 3 binary residue encoder

    modulo 5 binary residue encoder

    modulo 7 binary residue encoder

    3-ou

    t-of

    -15

    bina

    ry

    resi

    due

    deco

    der

    modulo 3 binary residue encoder

    modulo 5 binary residue encoder

    modulo 7 binary residue encoder

    1X =13 2X =12 3X =9

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    12

    13

    14

    result

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    12

    13

    14

    processor 1 processor 2 processor 3

    0 1 0 0 0 0

    0 1 1 0 1 0 1 0 0

    1 1 0 1 0 1 0 1 0

    1

    0

    1000

    0

    Figure 2: The permutation network processor shown to compute the sum13 +105 12 +105 9 = 34.

    Step 2.2: Feed all input lines marked 0,m1,m1+m2, . . . ,m1+m2+. . .+mr−1

    with a “1” and all the other input lines with a “0.”

    Step 3: Decode the r-out-of-s residue code obtained at the outputs of the last

    permutation network processor in the cascade into its binary equivalent.

    Figure 2 shows an example with N = 3, n = 5, and X1 = 13, X2 = 12, and X3 = 9.

    The light lines indicate the paths of “1” between inputs and outputs. To avoid a

    possible overflow, the dynamic range of the processor is chosen so that it satisfies

    10

  • modulo 3 binary residue encoder

    modulo 5 binary residue encoder

    modulo 7 binary residue encoder

    modulo 3 binary residue encoder

    modulo 5 binary residue encoder

    modulo 7 binary residue encoder

    3-ou

    t-of

    -15

    bina

    ry

    resi

    due

    deco

    der

    modulo 3 binary residue encoder

    modulo 5 binary residue encoder

    modulo 7 binary residue encoder

    1X =13 2X = - 12 3X =9

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    12

    13

    14

    result

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    12

    13

    14

    processor 1 processor 2 processor 3

    0 1 0 0 0 0

    0 1 1 0 1 1 1 0 0

    1 1 0 0 1 0 0 1 0

    1

    00010

    0

    Figure 3: The permutation network processor shown to compute13−105 12 +105 9 = 10.

    2nN ≤M, that is, 3×25 ≤M. Therefore, M is picked as 105 = 3×5×7. We notethat the summands can be negative as well. To illustrate this, Figure 3 depicts

    how 13−105 12 +105 9 = 10 is computed on the same network given in Figure 2.

    This algorithm requires O(N) processors and since it does not contain any loop

    and each step takes O(1) time, the total execution time of this algorithm is O(1).

    B. Real Inner-Product Processor

    Let X = (X1, X2, . . . , XN) and Y = (Y1, Y2, . . . , YN) be two N -element vectors,

    where Xi, Yi ∈ ZM = {0, 1, . . . ,M − 1}. Then the real inner product of X and Y

    11

  • is a real-valued function defined as

    X ·Y =N∑j=1

    XjYj. (5)

    To compute XjYj on a permutation network processor, we recall that the multi-

    plication modulo M over the set ZM forms a monoid. Let Zmi = {0, 1, 2, . . . ,m1−1} and (Zmi ,×mi) denote the monoid under modulo mi multiplication of ele-ments in Zmi ; 1 ≤ i ≤ r. Let M = m1m2 . . .mr, where m1,m2, . . . ,mr are allprimes. Let ZM = Zm1 ⊗ Zm2 ⊗ . . . ⊗ Zmr denote the direct product of Zmi ; 1 ≤i ≤ r, whose elements are r-tuples (x1, x2, . . . , xr), where xi ∈ Zmi . For any(x1, x2, . . . , xr), (y1, y2, . . . , yr) ∈ ZM define

    (x1, x2, . . . , xr) ⊗ (y1, y2, . . . , yr) =

    (x1×m1y1, x2×m2y2, . . . , xr×mryr). (6)

    Now we carry over the product XiYi onto (ZM ,⊗) by setting up a monoid isomor-phism f between Zm and ZM as follows:

    f(Xj) = (xj,1, xj,2, . . . , xj,r), ∀Xj ∈ ZM , 1 ≤ j ≤ N (7)

    where xj,i = Xj mod mi; 1 ≤ i ≤ r. Let Xj, Yj ∈ ZM . It is easy to show that

    f(1) = (1, 1, . . . , 1) and

    f(Xj · Yj) = f(Xj)⊗ f(Yj) (8)

    and hence f is an isomorphism, and using this isomorphism, we can compute the

    real inner product X ·Y in ZM as

    X ·Y =N∑j=1

    XjYj

    =N∑j=1

    (xj,1, xj,2, . . . , xj,r)⊗ (yj,1, yj,2, . . . , yj,r)

    = (N∑j=1

    xj,1×m1yj,1,N∑j=1

    xj,2×m2yj,2, . . . ,N∑j=1

    xj,r×mryj,r). (9)

    12

  • 0

    1

    2

    3

    4

    0

    1

    2

    3

    OR

    0

    1

    2

    3 1-ou

    t-of

    -5 r

    esid

    ue d

    ecod

    er

    0

    1

    2

    3

    4

    AND

    AND

    AND

    AND

    1

    2

    3

    4

    1-ou

    t-of

    -4

    resi

    due

    deco

    der

    0

    1

    0

    1

    2

    3

    4

    2

    3

    (generator =2)

    g

    g

    g-1

    1

    0

    0(binary

    residue code )

    (1-out-of-5 residue code)

    (1-o

    ut-o

    f-5

    resi

    due

    code

    )in

    put (

    X=

    3)

    input (Y = 3)

    output (R=4)

    Figure 4: Permutation network modulo 5 multiplier.

    The corresponding permutation network real inner product processor is constructed

    as follows. For each modulus, mi, N modulo mi permutation network processors

    are cascaded. One operand of each processor comes from the previous stage and

    the other operand comes from the output of a modulo mi permutation network

    multiplier. An example of modulo 5 permutation network multiplier is shown in

    Figure 4 and a detailed description of modulo mi permutation network multiplier

    can be found in [10].

    The following algorithm shows how the inner product is carried out using such

    multipliers.

    Algorithm 2 (Computing the real inner product of two vectors)

    Input: Two vectors X = (X1, X2, . . . , XN) and Y = (Y1, Y2, . . . , YN), where each

    Xi and Yi is an n-bit 2’s complement number.

    Output: The real inner product of X and Y represented in 2’s complement form.

    Method:

    Step 1: Convert each element of X and Y into its corresponding residue code in

    parallel.

    13

  • Step 2: Multiply xji and yji for 1 ≤ j ≤ N and 1 ≤ i ≤ r in parallel.

    Step 3: Compute the inner product by adding together the products obtained in

    Step 2 over the residue code domain.

    Step 3.1: Set up the switching states of permutation network processors in

    parallel by the binary residue codes obtained in Step 2.

    Step 3.2: Feed all input lines marked 0,m1,m1+m2, . . . ,m1+m2+. . .+mr−1

    with a “1” and all the other input lines with a “0.”

    Step 4: Decode the r-out-of-s residue code obtained at the outputs of the last

    permutation network processor in the cascade into its binary equivalent.

    An example with N = 3 and n = 3 is shown in Figure 5. It is easy to see that

    Algorithm 2 requires O(N) processors and its total execution time is O(1) since

    each step takes O(1) time and there is no loop inside the algorithm.

    C. Complex Inner-Product Processor

    The inner product of two complex vectors U = (U1, U2, . . . , UN) and V = (V1, V2, . . . , VN)

    is a complex-valued function defined as follows:

    U ·V =N∑j=1

    UjV∗j , (10)

    where Uj and Vj; 1 ≤ j ≤ N, are complex elements, and V ∗j denotes the complexconjugate of Vj.

    Let Uj = Re(Uj) + iIm(Uj) and Vj = Re(Vj) + iIm(Vj), where i =√−1, then

    U ·V =N∑j=1

    {Re(Uj) + iIm(Uj)}{Re(Vj)− iIm(Vj)}

    =N∑j=1

    {Re(Uj)Re(Vj) + Im(Uj)Im(Vj)}

    +iN∑j=1

    {Im(Uj)Re(Vj)−Re(Uj)Im(Vj)}. (11)

    14

  • 3-ou

    t-of

    -15

    bina

    ry

    resi

    due

    deco

    der

    012

    34567

    89

    1011121314

    ××3

    ××5

    ××7

    ××3

    ××5

    ××7

    ××3

    ××5

    ××7

    012

    34567

    89

    1011121314

    1X 1Y 2X 2Y 3X 3Y

    result

    i1-out-of-m residue encoder modulo i permutation network multiplier×× i

    processor 1 processor 2 processor 3

    mod 7 permutation

    network adder

    mod 7 permutation

    network adder

    mod 7 permutation

    network adder

    mod 5 permutation

    network adder

    mod 5 permutation

    network adder

    mod 5 permutation

    network adder

    mod 3 permutation

    network adder

    mod 3 permutation

    network adder

    mod 3 permutation

    network adder

    Figure 5: The permutation network processor for computing the real inner-productof X and Y (N = 3 and n = 3.)

    15

  • Therefore, the complex inner product can be computed by carrying out four real

    inner products. Let ⊕ and ª be binary operators defined on ZM by

    (x1, x2, . . . , xr) ⊕ (y1, y2, . . . , yr) =

    (x1 +m1 y1, x2 +m2 y2, . . . , xr +mr yr) (12)

    (x1, x2, . . . , xr) ª (y1, y2, . . . , yr) =

    (x1 −m1 y1, x2 −m2 y2, . . . , xr −mr yr), (13)

    where (x1, x2, . . . , xr), (y1, y2, . . . , yr) ∈ ZM . Using the isomorphism constructed inthe previous section, each real inner product can then be computed in ZM as in

    Equation (9), and hence

    U ·V =N∑j=1

    {Re(uj,1, uj,2, . . . , uj,r)⊗Re(vj,1, vj,2, . . . , vj,r)

    ⊕Im(uj,1, uj,2, . . . , uj,r)⊗ Im(vj,1, vj,2, . . . , vj,r)}+

    iN∑j=1

    {Im(uj,1, uj,2, . . . , uj,r)⊗Re(vj,1, vj,2, . . . , vj,r)

    ªRe(uj,1, uj,2, . . . , uj,r)⊗ Im(vj,1, vj,2, . . . , vj,r)} .

    =

    N∑j=1

    (Re(uj,1)×m1Re(vj,1) +m1 Im(uj,1)×m1Im(vj,1)),

    N∑j=1

    (Re(uj,2)×m2Re(vj,2) +m2 Im(uj,2)×m2Im(vj,2)), . . . ,

    N∑j=1

    (Re(uj,r)×mrRe(vj,r) +mr Im(uj,r)×mrIm(vj,r))+

    i

    N∑j=1

    (Im(uj,1)×m1Re(vj,1)−m1 Re(uj,1)×m1Im(vj,1)),

    N∑j=1

    (Im(uj,2)×m2Re(vj,2)−m2 Re(uj,2)×m2Im(vj,2)), . . . ,

    N∑j=1

    (Im(uj,r)×mrRe(vj,r)−mr Re(uj,r)×mrIm(vj,r)) . (14)

    The complex inner product processor can then be constructed by computing the

    real and imaginary parts on two permutation network real inner product proces-

    16

  • sors. These real inner product processors require some minor modifications. For

    the real part, the actual inner product consists of two real inner products, one

    combining the real parts of U and V and the other combining their imaginary

    parts. Thus the subnetwork for each modulus in each processor is obtained by

    cascading two multiplication and two addition units. For the imaginary part, the

    actual inner product also consists of two real inner products, however, these two

    inner products are not added together; rather the second one is subtracted from

    the first one. Therefore, we cascade one multiplication and addition unit with a

    multiplication and subtraction unit. The subtraction unit can be implemented

    by a circular left shift permutation network subtracter as described in [10]. The

    general structure for the permutation network complex inner product processor is

    shown in Figures 6 and 7. Figure 6 is the real part and Figure 7 is the imaginary

    part. We note that the sum expressions on the left hand side are just for labeling

    the inputs and do not imply any summation.

    Algorithm 3 (Computing the complex inner product of two vectors)

    Input: Two vectors U = (U1, U2, . . . , UN) and V = (V1, V2, . . . , VN). Uj and Vj

    are complex numbers in the form: a+ ib, where a and b are n-bit 2’s complement

    numbers.

    Output: The complex inner product of U and V represented in a form c + id,

    where c and d are 2’s complement numbers.

    Method:

    Step 1: Convert each element of U and V into its corresponding residue code in

    parallel. (Note that the real part and imaginary part use separate residue

    decoders.)

    Step 2: Compute the real multiplications, Re(uji)Re(vji), Im(uji)Im(vji),

    Im(uji)Re(vji), and Re(uji)Im(vji), for 1 ≤ j ≤ N and 1 ≤ i ≤ r in parallel.

    Step 3: Compute the real and imaginary parts of the complex inner product by

    17

  • ××m

    1××

    m1

    ××m

    i××

    mi

    ××m

    r××

    mr

    Re(

    U ) 1

    Re(

    V ) 1

    Im(V

    ) 1Im

    (U ) 1

    •••

    •••

    •••

    •••

    ××m

    1××

    m1

    ××m

    i××

    mi

    ××m

    r××

    mr

    Re(

    U ) 2

    Re(

    V ) 2

    Im(V

    ) 2Im

    (U ) 2

    •••

    •••

    •••

    •••

    ××m

    1××

    m1

    ××m

    i××

    mi

    ××m

    r××

    mr

    Re(

    U ) N

    Re(

    V ) N

    Im(V

    ) NIm

    (U ) N

    •••

    •••

    •••

    •••

    r-out-of-s binary residue decoder

    Re(

    R)

    •••

    •••

    •••

    •••

    •••

    •••

    •••

    •••

    •••

    •••

    •••

    •••

    •••

    •••

    •••

    •••

    •••

    •••

    •••

    •••

    •••

    •••

    •••

    •••

    proc

    esso

    r 1

    proc

    esso

    r 2

    proc

    esso

    r N

    1-ou

    t-of

    -m

    resi

    due

    enco

    der

    mod

    ulo

    i per

    mut

    atio

    n ne

    twor

    k m

    ultip

    lier

    ××i

    PN

    : per

    mut

    atio

    n ne

    twor

    ki

    0 1 −m

    11

    ∑i-1 k=

    1m

    k

    ∑i-1 k=

    1m

    k+1

    ∑i k=

    1m

    k-1

    ∑r-

    1k=

    1m

    k

    ∑r-

    1k=

    1m

    k+1

    ∑r k=

    1m

    k-1

    mod

    m

    PN

    ad

    der

    1m

    od m

    P

    N

    adde

    r

    1m

    od m

    P

    N

    adde

    r

    1m

    od m

    P

    N

    adde

    r

    1m

    od m

    P

    N

    adde

    r

    1m

    od m

    P

    N

    adde

    r

    1

    mod

    m

    PN

    ad

    der

    im

    od m

    P

    N

    adde

    r

    im

    od m

    P

    N

    adde

    r

    im

    od m

    P

    N

    adde

    r im

    od m

    P

    N

    adde

    r

    im

    od m

    P

    N

    adde

    r

    i

    mod

    m

    PN

    ad

    der

    rm

    od m

    P

    N

    adde

    r

    rm

    od m

    P

    N

    adde

    r

    rm

    od m

    P

    N

    adde

    r

    rm

    od m

    P

    N

    adde

    r

    rm

    od m

    P

    N

    adde

    r

    r

    Figure 6: The real part of a permutation network complex inner-product processor.

    18

  • ××m

    1××

    m1

    ××m

    i××

    mi

    ××m

    r××

    mr

    Re(

    U ) 1

    Re(

    V ) 1

    Im(V

    ) 1Im

    (U ) 1

    •••

    •••

    •••

    •••

    ××m

    1××

    m1

    ××m

    i××

    mi

    ××m

    r××

    mr

    Re(

    U ) 2

    Re(

    V ) 2

    Im(V

    ) 2Im

    (U ) 2

    •••

    •••

    •••

    •••

    ××m

    1××

    m1

    ××m

    i××

    mi

    ××m

    r××

    mr

    Re(

    U ) N

    Re(

    V ) N

    Im(V

    ) NIm

    (U ) N

    •••

    •••

    •••

    •••

    r-out-of-s binary residue decoder

    Im(R

    )

    •••

    •••

    •••

    •••

    •••

    •••

    •••

    •••

    •••

    •••

    •••

    •••

    •••

    •••

    •••

    •••

    •••

    •••

    •••

    •••

    •••

    •••

    •••

    •••

    1-ou

    t-of

    -m

    resi

    due

    enco

    der

    mod

    ulo

    i per

    mut

    atio

    n ne

    twor

    k m

    ultip

    lier

    ××i

    PN

    : per

    mut

    atio

    n ne

    twor

    k

    proc

    esso

    r 1

    proc

    esso

    r 2

    proc

    esso

    r N

    0 1

    −m

    11

    ∑i-1 k=

    1m

    k

    ∑i-1 k=

    1m

    k+1

    ∑i k=

    1m

    k-1

    ∑r-

    1k=

    1m

    k

    ∑r-

    1k=

    1m

    k+1

    ∑r k=

    1mk-

    1

    mod

    m

    PN

    su

    btra

    cter1

    mod

    m

    PN

    ad

    der

    1m

    od m

    P

    N

    subt

    ract

    er1m

    od m

    P

    N

    adde

    r

    1m

    od m

    P

    N

    subt

    ract

    er1m

    od m

    P

    N

    adde

    r

    1

    mod

    m

    PN

    ad

    der

    im

    od m

    P

    Nsu

    btra

    cter

    im

    od m

    P

    N

    adde

    r

    im

    od m

    P

    Nsu

    btra

    cter

    im

    od m

    P

    N

    adde

    r

    im

    od m

    P

    Nsu

    btra

    cter

    i

    mod

    m

    PN

    ad

    der

    rm

    od m

    P

    N

    subt

    ract

    er

    rm

    od m

    P

    N

    adde

    r

    rm

    od m

    P

    N

    subt

    ract

    er

    rm

    od m

    P

    N

    adde

    r

    rm

    od m

    P

    N

    subt

    ract

    er

    r

    Figure 7: The imaginary part of a permutation network complex inner-productprocessor.

    19

  • adding together the corresponding products obtained in Step 2 according to

    Equation (14) over the residue code domain.

    Step 3.1: Set up the switching states of permutation network processors

    (adders and subtracters) in parallel by the binary residue codes obtained

    in Step 2.

    Step 3.2: Feed all input lines marked 0,m1,m1+m2, . . . ,m1+m2+. . .+mr−1

    with a “1” and all the other input lines with a “0.”

    Step 4: Decode the r-out-of-s residue codes obtained at the outputs of the last

    permutation network processors in the real and imaginary parts into their

    binary equivalents.

    As in the previous algorithm, it is easy to see that total execution time of Algorithm

    3 is O(1). Its total cost is O(N) as the real part and the imaginary part require

    N processors, respectively. In more exact terms, for the real part, each processor

    consists of two permutation network adders in cascade. For the imaginary part,

    each processor consists of a permutation network adder and a permutation network

    subtracter. Therefore, a total of 4N processors are required by this algorithm.

    4 Matrix Multiplication

    In this section, we use permutation network inner product processors to carry out

    real and complex matrix multiplications.

    A. Real Matrix Multiplication

    Let Aj be the jth row of A and Bk be the kth column of B. Let A and B be

    N ×N real matrices and C = A×B. C can be computed by using the followingprocedure:

    For j = 1 to N do in parallel

    For k = 1 to N do in parallel

    20

  • Cj,k = Aj ·BkEndfor

    Endfor.

    Hence, if we use N2 inner product processors in parallel, the multiplication of two

    real matrices can be computed in O(1) steps as follows.

    Algorithm 4 (Computing the real matrix multiplication of two real ma-

    trices)

    Input: Two N ×N real matrices A and B. Each element of A and B is an n-bit2’s complement number.

    Output: A real product matrix C = A×B.Method:

    Step 1: Convert each element of A and B into its corresponding residue code in

    parallel.

    Step 2: Compute the N2 inner products using Algorithm 2.

    A schematic version of this algorithm is depicted in Figure 8. As can be seen from

    the figure, all inner product processors operate in parallel. Since each executes in

    O(1) time, Algorithm 4 takes O(1) time to execute. It needs a total of N2 inner

    product processors each having O(N) cost and hence its cost is O(N3).

    B. Complex Matrix Multiplication

    Now we consider the multiplication of two complex matrices. Let R = A+ iB and

    S = C + iD be two complex matrices of size N ×N. Let T = RS = E + iF be theproduct matrix. Then

    T = RS = (A + iB)(C + iD)

    = (AC−BD) + i(AD + BC), (15)

    and hence the complex matrix multiplication is reduced to four real matrix multi-

    plications, one addition, and one subtraction. Let Aj and Bj denote the jth rows

    21

  • • • •

    • • •

    • • •

    A1

    A2

    AN

    C = A B2,1 12•

    C = A BN,1 1N• C =A BN,2 2N• C =A BN,N NN•

    C = A B2,2 22• C = A B2,N N2•

    C = A B1,1 11• C = A B1,2 21• C = A B1,N N1•

    • • •

    • • •

    • • •

    • • •

    B 1 B 2 BN• • •

    Figure 8: Real matrix multiplication on permutation network inner-product pro-cessors. (Each rectangular box represents a permutation network inner productprocessor).

    of A and B, respectively and Ck and Dk denote the jth columns of C and D,

    respectively. Thus, to compute E = AC−BD, we use the following procedure.

    For j = 1 to N do in parallel

    For k = 1 to N do in parallel

    Ej,k = AjCk −BjDkEndfor

    Endfor.

    The inner equation is a difference of two inner products. Comparing this to the

    imaginary part of the complex inner product, it is easy to see that they have the

    same algebraic form. Therefore, matrix E can be computed using Algorithm 4 by

    replacing the inner product processor in the algorithm with the imaginary part of

    the complex inner product processor shown in Figure 7. This amounts to replacing

    each of the inner product processors in Figure 8 by the imaginary inner product

    processor given in Figure 7.

    22

  • Likewise, to compute F = AD + BC, we use the procedure:

    For j = 1 to N do in parallel

    For k = 1 to N do in parallel

    Fj,k = AjDk + BjCk

    Endfor

    Endfor.

    The inner equation entails two inner products as in the real part of the complex

    inner product processor shown in Figure 6. Thus F can be computed using Algo-

    rithm 4.

    Combining these facts, we conclude that the multiplication of two N ×N complexmatrices can be carried in O(1) time using O(N3) permutation network processors.

    5 Concluding Remarks

    In this paper, we proposed permutation network processors to compute algebraic

    sums, inner and matrix products. It has been shown that the algebraic sum of N n-

    bit 2’s complement numbers can be computed inO(1) time on such a processor with

    O(N) cost. The inner product of two N -element vectors (both real and complex)

    with n-bit elements can also be done in O(1) time using a similar processor with

    O(N) cost. On the other hand, the matrix multiplication takes O(1) time but with

    O(N3) permutation network processors.

    These results are important in that they establish that one can avoid using conven-

    tional adder and multiplier circuits to carry out vector and matrix computations.

    It will be worthwhile to extend them to other computations such as discrete trans-

    forms, convolutions, and correlation computations. We anticipate that all of these

    computations can also be done in O(1) time and O(N2) cost and we plan to present

    our results on these computations in another place.

    23

  • References

    [1] Selim G. Akl,“The design and analysis of parallel algorithms,” Prentice-Hall

    Pub., Englewood Cliffs, N. J., 1989.

    [2] D. M. Champion and J. Rothstein,“Immediate parallel solution of the longest

    common subsequence problem,” Proc. IEEE Int. Conf. Parallel Processing,

    pp. 70-77, 1987.

    [3] Gen-Huey Chen, Bing-Feng Wang, and Chi-Jen Lu, “On the parallel compu-

    tation of the algebraic path problem,”IEEE Trans. Parallel and Distributed

    Systems, vol. 3 , pp. 251-256, March 1992.

    [4] K. M. Elleithy, M. A. Bayoumi, and K. P. Lee,“θ(lgN) architecture for RNS

    arithmetic decoding,” IEEE 9th Symposium on Computer Arithmetic, pp.

    202-209, 1989.

    [5] John L. Gustafson,“Computer-intensive processors and multicomputers,” in

    Parallel processing for supercomputers and artificial intelligence, Kai Hwang

    and Douglas Degroot, Eds. pp. 169-201, 1989.

    [6] Kai Hwang, “Computer arithmetic: principle, architecture, and design,”

    John Wiley & Sons Pub., 1979.

    [7] Kai Hwang and Faye A. Briggs, “Computer architecture and parallel process-

    ing,” McGraw-Hill Pub., 1984.

    [8] S. Lakshmivarahan and Sudarshan K. Dhall,“Analysis and design of parallel

    algorithms,” McGraw-Hill Pub., 1990.

    [9] Ming-Bo Lin and A. Yavuz Oruç,“The design of a network-based arithmetic

    processor,” UMIACS-TR-91-141, University of Maryland, College Park, MD,

    October 1991.

    24

  • [10] Ming-Bo Lin, “Unified Algebraic Computations on Permutation Networks,”

    Ph.D. Dissertation, Electrical Engineering Department, University of Mary-

    land, College Park, 1992.

    [11] Massimo Maresca and Hungwen Li, “Connection autonomy in SIMD comput-

    ers: A VLSI implementation,”Journal of Parallel and Distributed Computing,

    vol. 7, pp. 302-320, 1989.

    [12] R. M. Merseran and D. E. Dudgeon, “Multidimensional signal processing,”

    Prentice-Hall Pub., Englewood Cliffs, N. J., 1985.

    [13] R. Miller, V. K. Prasanna Kumar, D. Reisis, and Q. F. Stout,“Data movement

    operations and applications on reconfigurable VLSI arrays,”Proc. IEEE Int.

    Conf. Parallel Processing, pp. 205-208, 1988.

    [14] A. V. Oppenheim and R. W. Schafer,“Digital signal processing,” Prentice-

    Hall Pub., Englewood Cliffs, N. J., 1975.

    [15] A. Yavuz Oruç, Vinod G. J. Peris, and M. Yaman Oruç, “Parallel modu-

    lar arithmetic on a permutation network,” Proceedings of 1991 International

    Conference on Parallel Processing, vol. 1, pp. 706-707, St. Charles, IL, Aug.

    1991.

    [16] Vinod G. J. Peris,“Fast modular arithmetic on permutation networks,” Mas-

    ter Thesis, Electrical Engineering Department, University of Maryland, Col-

    lege Park,MD, 1992.

    [17] Stanislaw J. Piestrak,“Design of residue generators and multi-operand mod-

    ular adders using carry-save adders,” IEEE 10th Symposium on Computer

    Arithmetic, pp. 100-107, 1991.

    [18] R. J. Schalkoff,“Digital image processing and computer vision,” John Wiley

    & Sons Pub., inc., New-York, 1989.

    25

  • [19] W. Shen and A. Yavuz Oruç, “Mapping algebraic formulas onto mesh con-

    nected processor networks,” Information Sciences and Systems, Princeton

    University, N.J., pp.535-538, 1986.

    [20] S. P. Smith and H. C. Torng, “Design of a fast inner product processor,”Proc.

    IEEE 7th Symposium on Computer Arithmetic, pp. 38-43, 1985.

    [21] E. E. Swartzlander, Jr., Barry K. Gilbert, and Irving S. Reed, “Inner product

    computers,”IEEE Trans. Computers, vol. C-27, pp. 21-31, January 1978.

    [22] B. F. Wang, C. J. Lu, and G. H. Chen, “Constant time algorithms for the

    transitive closure problem and its applications,”Proc. IEEE Int. Conf. Par-

    allel Processing, pp. 52-59, 1990.

    [23] B. F. Wang, G. H. Chen, and F. C. Lin, “Constant time sorting on a processing

    array with a reconfigurable bus system,”Information Processing Letters, vol.

    34, pp. 187-192, 1990.

    [24] B. F. Wang and G. H. Chen, “Constant time algorithms for the transitive

    closure and some related graph problems on processor arrays with reconfig-

    urable bus systems,”IEEE Trans. Parallel and Distributed Systems, vol. 1,

    pp. 500-507, October 1990.

    [25] B. F. Wang and G. H. Chen, “Two-dimensional processor array with a recon-

    figuration bus system is at least as powerful as CRCW model,”Information

    Processing Letters, vol. 36, pp. 31-36, October 1990.

    [26] B. F. Wang, G. H. Chen, and Hungwen Li, “Configurational computation:

    a new computation method on processor arrays with reconfigurable bus sys-

    tems,”Proc. IEEE Int. Conf. Parallel Processing, vol. III pp. 42-49, 1991.

    26