Mapping of Neural Networks onto the Memory-Processor …algo.yonsei.ac.kr/international_JNL/MNnetworks98Kim.pdf · 2013-02-13 · Mapping of Neural Networks onto the Memory-Processor

1

Mapping of Neural Networks onto the Memory-ProcessorIntegrated Architecture�

Youngsik Kim

[email protected]

Mi–Jung Noh

[email protected]

Tack–Don Han

[email protected]

Shin–Dug Kimy

[email protected]

y Corresponding Author

Dept. of Computer Science, Yonsei University

134, Shinchon–Dong, Seodaemun–Ku, Seoul 120–749, Korea

Tel. : +82–2–361–2715 Fax : +82–2–365–2579

Submitted toNeural Networksin March 1997

Revised in May 1998

� A preliminary version of this paper appeared inProc. Int’l Conf. Neural Networks’ 97.

� This study was supported by the academic research fund of Ministry of Education, Republic

of Korea through Inter–University Semiconductor Research Center(ISRC 97–E–2022) in Seoul

National University.

Mapping of Neural Networks onto the Memory-Processor Integrated Architecture 2

Mapping of Neural Networks onto the Memory-ProcessorIntegrated Architecture

ABSTRACT

In this paper an effective memory–processor integrated architecture,calledmemory–based

processor array for artificial neural networks(MPAA), is proposed. The MPAA can be easily

integrated into any host system via memory interface. Specifically, the MPAA system provides

an efficient mechanism for its local memory accesses allowed by the row basis and the column

basis using the hybrid row and column decoding, which is suitable for the computation model

of ANNs such as the accessing and alignment patterns given for matrix–by–vector operations.

Mapping algorithms to implement the multilayer perceptron with backpropagation learning on

the MPAA system are also provided. The proposed algorithms support both neuron and layer

level parallelisms which allow the MPAA system to operate the learning phase as well as the

recall phase in the pipelined fashion. Performance evaluation is provided by detailed compari-

son in terms of two metrics such as the cost and the number of computation steps. The results

show that the performance of the proposed architecture and algorithms is superior to those of the

previous approaches, such as one–dimensional single instruction multiple data (SIMD) arrays,

two–dimensional SIMD arrays, systolic ring structures, and hypercube machines.

Key words : parallel processing, memory–processor integration, multilayer perceptron, back-

propagation learning, algorithmic mapping.


1 Introduction

Artificial neural networks (ANNs) have been widely used in various applications such

as pattern classification, speech recognition, machine vision, optimization, matching, image

restoration, and so forth. Many algorithmic mapping techniques to implement ANNs on the

available parallel architectures considering the inherent parallelism of ANNs have been reported

(El–Amawy & Kulasinghe, 1997; Ghosh & Hwang, 1989; Kumar, Shekhar, & Amin, 1994b;

Kung & Hwang, 1989; Lin, Prasanna, & Przytula, 1991; Malluhi, Bayoumi, & Rao, 1995;

Nordstrom & Svensson, 1992; Singer, 1990; Svensson & Nordstrom, 1990; Wah & Chu, 1990).

A number of algorithms mapped onto various architectures were surveyed in (Nordstrom

& Svensson, 1992). Examples of algorithmic mapping schemes are the implementation of

ANNs on two–dimensional single instruction multiple data (SIMD) arrays (Lin, Prasanna, &

Przytula, 1991; Singer, 1990), one–dimensional SIMD arrays (Svensson & Nordstrom, 1990),

the cascaded systolic ring arrays (Kung & Hwang, 1989), the hypercube architectures (Kumar,

Shekhar, & Amin, 1994b; Malluhi, Bayoumi, & Rao, 1995), the multicomputers (Ghosh &

Hwang, 1989; Wah & Chu, 1990), and the multiple bus systems (El–Amawy & Kulasinghe,

1997). The mapping algorithms proposed in (Lin, Prasanna, & Przytula, 1991) are efficient for

a network topology as long as the interconnections among neurons are sparse. However, this

scheme needed a large number of processors (O(N2) if N is the number of neurons at the largest

layer). In order to improve inefficient inter–processor communications in one–dimensional

SIMD array, an adder tree hardware was proposed in (Svensson & Nordstrom, 1990). A mapping

technique (Malluhi, Bayoumi, & Rao, 1995) on hypercube architectures can take the optimal

computation steps (O(log2N) if N is the number of neurons at the largest layer) in spite of the

large number 4N2 of processors. In (Kumar, Shekhar, & Amin, 1994b), a mapping technique

calledcheckerboardingon hypercube and related architectures was proposed. Checkerboarding

can avoid the all–to–all broadcast operation. A mapping scheme (El–Amawy & Kulasinghe,

1997) implemented on the multiple bus systems has a relative merit over checkerboarding

scheme. However, the processors in the multiple bus systems should support the complex


communication of the dynamic interconnection structure. An analytical model assessing the

performance of ANNs implemented on linear arrays was presented in (Naylor & Jones, 1994).

This paper provides a further consideration of typical four mapping schemes on two–dimensional

SIMD arrays (Singer, 1990), one–dimensional SIMD arrays (Svensson & Nordstrom, 1990), the

cascaded systolic ring arrays (Kung & Hwang, 1989), and the hypercube architectures (Malluhi,

Bayoumi, & Rao, 1995) in the later section 4.1.

Because the current memory technology can support the gigabit DRAMs, a single memory

chip would cover the memory volume needed for the computer systems in the future. A number

of studies (Aimoto et al., 1996; Elliott, Snelgrove, & Stumm, 1992; Gokhale, Holmes, & Iobst,

1995; Inoue, Nakamura, & Kawai, 1995; Kogge, 1994; Shimizu et al., 1996; Yamashita et al.,

1994) for the memory–logic integration have utilized both high internal memory bandwidth

and the available chip density. For computer graphics, a large amount of DRAMs and a

small number of logic circuits are integrated into a 3–D DRAM chip (Inoue, Nakamura, &

Kawai, 1995). A processor and memory integration onto a chip (Shimizu et al., 1996) and

multiple instruction stream multiple data stream (MIMD) multiprocessors with their on–chip

local memories (Kogge, 1994) were proposed in order to overcome the low bandwidth to the local

memory. Also, memory–processor integrated arrays, such as computational RAM (C–RAM)

(Elliott, Snelgrove, & Stumm, 1992), integrated memory array processor (IMAP) (Yamashita et

al., 1994), processing in memory (PIM) (Gokhale, Holmes, & Iobst, 1995), and parallel image

processing RAM (PIP–RAM) (Aimoto et al., 1996) which integrate the SIMD processors and

their local memories within a chip have been proposed. However, an algorithmic study of ANNs

has not been applied to the aforementioned memory–processor integrated architectures.

In this paper, a memory–processor integrated architecture efficiently supporting the com-

putation model of ANNs, calledmemory–based processor array for ANNs(MPAA), is proposed.

The parallel algorithms of the multilayer perceptron method with backpropagation learning are

also mapped onto the MPAA system. The proposed architecture and its associated algorithms

show several advantages. First, previous architectures and algorithms providing synaptic weight

level parallelism, e.g., two–dimensional SIMD arrays (Lin, Prasanna, & Przytula, 1991; Singer,


1990) and hypercube MIMD architecture (Malluhi, Bayoumi, & Rao, 1995), often needed a

large number of processors, but the proposed architecture and algorithms practically supporting

neuron and layer level parallelisms need a moderate number of processors (O(N) if N is the

number of neurons at the largest layer). Second, in the MPAA system, any interaction by

the program and data between the host processor and the MPAA system can be resolved by

means of simple memory reads and writes. Also, for any given execution cycle each processing

units (PUs) of the MPAA system can execute a single instruction with an operand fetch in the

overlapped fashion for the effective use of high bandwidth given by the local memory.

Third, the MPAA system provides an efficient mechanism for memory accesses by the

row or column basis which is suitable for the computation model of ANNs. Because the

basic computation of ANNs can be represented by a series of matrix–by–vector and transposed

matrix–by–vector multiplications, where the matrices contain the synaptic weights and the

vectors contain activation values or error values, the architectures for ANNs have to support

the accessing and alignment patterns given by the matrix–by–vector operations (Nordstrom &

Svensson, 1992). The memory–processor integrated arrays in (Aimoto et al., 1996; Elliott,

Snelgrove, & Stumm, 1992; Gokhale, Holmes, & Iobst, 1995; Yamashita et al., 1994) were able

to access the memory only by the row basis, but the MPAA system can access the memory by

the row and/or column basis by using the hybrid row and column decoding. This capability

can replace some patterns of inter–PU communications with simple memory reads and writes.

Therefore, the MPAA system supports efficiently the matrix–by–vector multiplications without

any inter–PU communication and provides a new computation method for various linear algebra

applications as well as ANN computations.

Fourth, the proposed algorithms can adopt both neuron and layer level parallelisms by

using architectural features of the MPAA. In the MPAA, the pattern level pipeling can be applied

to the learning phase as well as the recall phase. Thus the number of computation steps for

ANNs on the MPAA system is of small number comparing with that of other approaches except

for hypercube architectures. In general the number of computation steps does not provide

any meaningful and fair information to compare various architectures with their corresponding


PM

PU

PM

PU IP

IMIL

control

PM

PU

HP

HM

control IU

SM. . .

dataaddresssingle address space

data&address controladdress &

system bus

MPAA

(a) The conceptual memory structure

System Bus

IL

addressIU

PUAB

PUCBdata

enable

control

AddressControl

DecoderI/O

IP

IM

PUAB : PU Address Bus PUCB : PU Command Bus

(b) The Interface Unit

Figure 1: The MPAA system architecture.

algorithms. In this paper, in order to perform a fair comparison, a cost function is used to

denote any performance given over the number of processors. Thus, the performance of ANN

algorithms mapped onto the MPAA is compared in detail with that of typical four schemes

(Kung & Hwang, 1989; Malluhi, Bayoumi, & Rao, 1995; Singer, 1990; Svensson & Nordstrom,

1990) in terms of the cost as well as the number of computation steps. The MPAA system can

reduce about 24.81%� 98.49% of the cost given by other architectures with their corresponding

algorithms (Kung & Hwang, 1989; Malluhi, Bayoumi, & Rao, 1995; Singer, 1990; Svensson &

Nordstrom, 1990).

In the following section, the MPAA system is described. In Section 3, the algorithms

of the multilayer perceptron with backpropagation learning mapped onto the MPAA system

are proposed. In Section 4, the MPAA system with the proposed algorithms is compared with

previous architectures and algorithms. Finally, Section 5 provides a conclusion.

2 The MPAA System Architecture

This section describes the architectural features of the MPAA system. The design issues

are also presented to construct the MPAA system. An effective interfacing mechanism with any

host system is designed as the basic building block to form a complete system construction.

Also the structure of the memory decoding logic configured over the PUs are designed.


2.1 Overview of the MPAA System Architecture

The MPAA system is designed to overcome some inefficient mechanisms of the conven-

tional SIMD machines in performing ANN applications. Specifically, design objectives of the

MPAA system are; 1) it should be easily integrated into any host system, from small personal

computers to multiprocessors, 2) it should provide minimum interaction overhead between the

MPAA and the host, 3) it should provide transparent structure to the programmers with the

conventional programming model, 4) it should allow multiple programs to be run in time multi-

plexed fashion without any program or data reloading, and 5) it should be constructed to utilize

inherent bandwidth given by the memory structure, eventually as the memory–based processor

array suitable for the computation model of ANNs.

The overall system structure consists of a host processor (HP), the HP memory module

(HM), a system bus, and a MPAA system as shown in Figure 1 (a). Thus, the MPAA system

can be interfaced into any host system, via its system bus. The MPAA system is constructed as

an interface unit (IU) and an array of processing units (PUs) with their associated PU memory

modules (PMs) as in Figure 1 (a). The IU consists of an interface logic (IL), an interface

processor (IP), and an IP memory module (IM) as its associated memory as shown in Figure 1

(b). In this system approach, system memory is physically divided into two modules, i.e., HM

and the shared memory (SM) as shown in Figure 1 (a). Here, HM is the main memory dedicated

to the HP and SM is a set of PMs and IM shared between the HP and PUs. In other words, SM

can be accessed by either the HP or PUs exclusively. Thus, SM is constructed as the dual ported

memory structure. In the view of the HP, SM is constructed as a portion of the HP’s single

linear address space. However, in the view point of each PU, SM is divided into independent

PMs associated with each PU and an IM associated with the IP. Thus, each PU can access its

own PM and the IP can access its own IM.

The IP, as the control unit of the MPAA system, controls the operation of every PU and

interacts with the HP. The IL coordinates the SM accesses between the HP and PUs by using

enable signal. Thus, the MPAA system can be configured as the two different operational modes,

i.e., simply as the memory or as the SIMD array. First, the MPAA system can be configured


as a portion of the HP’s memory. The HP inputs and outputs the data to and from the MPAA

system in the form of memory reads and writes by the arbitration of the IL. Therefore, the

MPAA system is considered as a part of the contiguous host memory address space. Second,

the MPAA system performs any data parallel operation as the SIMD array. Every PU in the

MPAA system can access its associated PM and thus can perform a SIMD instruction broadcast

on each of the data. Data parallel code blocks including any data required can be stored on the

IM and PMs (i.e., SM) at the program loading time.

2.2 System Operation

In this system approach, there exist three different types of processors, such as the HP,

the IP, and the PUs. Application programs can be classified into two major code blocks,

i.e., sequential code blocks performed by the HP and data parallel code blocks processed

by the PUs. As the interaction mechanism, control transfer between the HP and the MPAA

system is performed via conventional subroutine calling mechanism, causing the MPAA system

transparent to the programmers. This type of control transfer is called MPAA subroutine call, to

differentiate it from any conventional subroutine call. The overall execution flow of the MPAA

system can be classified into the following steps.

First, the HP compiles an application program and stores on the secondary storage. When

this program is executed, HP loads that program into its memory, i.e., HM and SM. When the

program is loaded into the memory, the sequential code blocks are loaded into the HM, and the

parallel code and data blocks are loaded into the IM and PMs of the SM, respectively. A single

address space viewed by the HP can be mapped onto a set of PMs by allowing each word to be

located linearly across PMs. Parallel code blocks are formed by a set of MPAA subroutines.

Then the HP starts executing the sequential code blocks. When the HP encounters any calling

instruction to initiate the MPAA, control is transferred to the IP. The HP suspends its operation

until the MPAA completes the execution of that subroutine. When a MPAA subroutine call

is invoked, target address to branch is the memory address in the IM corresponding to that

subroutine and this address is transferred to the IP. Then the IP starts executing instructions in


the IM.

Here, IP broadcasts sequentially parallel instructions to the PUs if needed. Then PUs can

execute instructions broadcast on their own data in the PMs. If the IP completes the MPAA

subroutine called by the HP, control is transferred to the HP again. In the following subsection,

the internal structure of the MPAA system is introduced.

2.3 Structure of the MPAA System

The memory–processor integrated arrays (Aimoto et al., 1996; Elliott, Snelgrove, &

Stumm, 1992; Gokhale, Holmes, & Iobst, 1995; Yamashita et al., 1994) can access their local

memories only based on the row–by–row decoding for the SIMD execution mode. For any

selected row, every processor located at each memory column can perform same operations

in parallel. However, the memory structure in the MPAA system is constructed as the two–

dimensional memory blocks divided by the number of PUs as shown in Figure 2 (a). Also, each

memory block consists of a certain amount of memory cells. For given any specific memory

block row or column address, every PU attached at each memory block column (row) can access

the memory location specified by the memory block row (column) address in parallel. Thus,

every PU can access the memory either by the memory block row or column address selectively

by using the multiplexer and the demultiplexer. This type of memory accessing pattern can

be supported by constructing the decoding logic to access row and column basis as shown in

Figure 2 (a).

In the MPAA system, a group of PUs with their associated memories can be integrated

into a single chip and this group is called a memory–based processor array block (MPAB).

Figure 2 (b) shows an example of the MPABs, each consisting of four PUs for simplicity. Each

PU is constructed as an ALU including a multiplier and an adder, a set of registers, and two

inter–layer connection ports to connect other PUs in neighbor MPABs as shown in Figure 2 (b).

An MPAB can perform any processing given to a single layer for a multilayer perceptron

algorithm with backpropagation learning and each PU can perform any operation assigned to

one neuron for neuron level parallelism. Thus, the MPAA system can be constructed by using


.

.

.

sense amp.

r cx cellsmemory block

sense amp.


sense amp.


sense amp.


sense amp.


sense amp.


sense amp.


sense amp.


sense amp.


mux/demux

mux/demux mux/demux mux/demux

mux/demuxmux/demux

mux/demux mux/demuxmux/demux

. . .0 1 C-1

bit PUc bit PUcbit PUc

1

0

R-1

TOP R

OW

DEC

OD

ER

TOP COLUMN DECODER

(a) A memory–based processor array block (MPAB)

l l lMPAB[ -1] MPAB[ ] MPAB[ +1]

PUmemory blockmemory cell

(b) A configuration of multi–MPAB system

Figure 2: The MPAA system.

the same number of MPABs as the number of layers required for given problems. Each PU in

an MPAB is connected to other PUs in neighbor MPABs by a bi–directional communication

path as shown in Figure 2 (b).

For a given multilayer perceptron algorithm, the following variables are defined to explain

the configuration of the multi–MPAB system.

� L : the number of layers for a given artificial neural network. The input layer is labeled

0, and is not counted inL. The output layer is labeledL, and layers 1 toL � 1 are the

hidden layers.

� Nl : the number of neurons at thel–th layer(0� l � L). The neurons between adjacent

layers are assumed to be fully connected.

� R� C : the number of memory blocks for an MPAB[l] at thel–th layer, whereR andC

are defined as the numbers of memory block rows and columns, respectively. The MPAB

of Figure 2 (b) has 4� 4 memory blocks represented by large boxes,.

� r� c : the number of memory cells for a memory block, wherer andc are defined as the


AC

1

S AR Ar

2 22log R log C log r

Figure 3: Address format.

numbers of memory cell rows and columns, respectively. A memory block of Figure 2

(b) has 2� 2 memory cells represented by small dashed boxes,.

As the above definition, the minimum number of MPABs required should beL, the

minimum number of PUs required in the MPAB[l] should be max(Nl�1; Nl), the MPAB[l]

at thel–the layer should have the entirerR � cC memory cells, the number of memory cell

columns for a memory block should be formed as the width of a data path for a PU, andR�C

should be larger thanNl �Nl�1.

2.4 Addressing Mechanism

When the MPAA system is used simply as the memory, both the row decoder and the

column decoder in Figure 2 (a) can be operated as the conventional memory decoder. However,

when the MPAA system is used as a SIMD array, every PU can access the data specified by

a memory block row or a memory block column and can process a SIMD operation. For the

SIMD mode, an MPAA address is generated by the address format consisting of four fields such

asS,AR(the address of a memory block row),AC(the address of a memory block column), and

Ar(the address of a memory cell row for a given memory block) as shown in Figure 3.S is a

one–bit field which selects either a memory block row or a memory block column. IfS is zero,

a memory block column addressed byAC is selected andAR is ignored. Otherwise, a memory

block row addressed byAR is selected andAC is ignored. The fourth field,Ar, is used to select

a memory cell row within a domain of each memory block addressed by one of the two fields,

AR andAC.

Actual MPAA addresses,AMPAA, by the row major order for the SIMD mode are obtained

as

AMPAA = S ��AR � r � c+ c � i

�+ (1� S) �

�AC � c+ r � c �C � j

�+Ar � c � C + k; (1)


layer 0

layer 2

layer 3

layer 1 x1[1]

x1[0] x2[0] x3[0]

x2[1] x3[1] x4[1]

x1[2] x2[2] x3[2]

x1[3] x2[3]

w11[1]

w21[1]

w31[1]

w41[1]

w12[1]

w22[1]

w32[1]

w42[1]

w13[1]

w23[1]

w33[1]

w43[1]

W[1] =

w31[2]

w21[2]

w11[2] w12[2]

w22[2]

w32[2]

w13[2]

w23[2]

w33[2]

w14[2]

w24[2]

w34[2]

w21[3]

W[2] =

W[3] = w11[3] w12[3]

w22[3]

w13[3]

w23[3]w11[3] w12[3] w13[3]

x3[2]=f w3j[2]xj[1]

Figure 4: Multilayer perceptron with backpropagation learning.

wherei = 0;1; :::; C � 1, j = 0;1; :::; R� 1, andk = 0;1; :::; c� 1.

For example of Figure 2 (b), if four fields of the address format in the SIMD mode are

(1,2,x,0), each PU can access memory cells located at the first memory cell row represented by

� within each memory block of the third memory block row. Also, if four fields of the address

format in the SIMD mode are (0,x,1,1), each PU can access memory cells located at the second

memory cell row represented by� within each memory block of the second memory block

column.

3 Mapping Algorithms to MPAA

In this section, general ANN model is provided and classified into two major phases. Also,

effective mapping algorithms are designed and applied to the MPAA system. These algorithms

are based on both neuron and layer level parallelisms and exploit the layer level pipelined

operations.

3.1 ANN Model

The ANN computation can be classified into two phases:recall phaseand learning

phase. The recall phase updates activation values of neurons at each layer based on the network

topology which refers to the forward procedure. An example of the three–layer perceptron with

backpropagation learning is shown in Figure 4. Each neuron, say neuroni, for every layer, say


layerl, has an activation valuexi[l]. The activation value vectorX[l] for the layerl consists of

elementsxi[l] for 1 � i � Nl. Associated with each connection from the neuronj at the layer

l � 1 to the neuroni at the layerl, is a synaptic weightwij[l]. The weight matrixW [l] for the

layer l consists of elementswij[l] for 1 � i � Nl and 1� j � Nl�1. The recall phase can be

formally described as

xi[l] = f(hi[l]) = f

0@Nl�1X

j=1

wij[l]xj[l � 1]

1A ; (2)

wherel = 1;2; :::; L, 1 � i � Nl, andX[0] stands for an input pattern, andf is an activation

function which is usually a nonlinear sigmoid function given byf(x) = 1=(1 + e�x) and the

derivativef 0 = f(1� f).

The learning phase establishes the values of the synaptic weights. Two basic procedures

of the learning phase are the forward procedure identical to the recall phase and the backward

procedure, where the produced outputxi[L] is compared to the target outputti and an error

value�i[L] is propagated backward to update weight values. The backward procedure can be

obtained as Equations (3,4).

�j [l � 1] = f 0(hj [l � 1])dj[l] = f 0(hj [l� 1])NlXi=1

wij[l]�i[l]; (3)

wherel = L;L� 1; :::;2, 1� j � Nl�1, and�i[L] = f 0(hi[L])(ti � xi[L]).

wij[l] = wij[l] + ∆wij[l] = wij[l] + ��i[l]xj[l � 1]; (4)

wherel = L;L� 1; :::;1, 1� i � Nl, and 1� j � Nl�1.

3.2 Mapping Algorithms

ConsiderL layer perceptron with backpropagation learning consisting ofNl neurons at

thel–th layer(0� l � L). For the mapping processes, some assumptions for the MPAA system

and the ANN applications are described.

First, each memory block coordinated by(i; j) at MPAB[l] loadswij[l]. Second, each

PUj[l] (1� j � Nl�1) at the MPAB[l] uses the registers to storexj[l�1] anddj [l]. Each PUi[l]

(1 � i � Nl) at the MPAB[l] assigns the registers to store�i[l] andhi[l]. Third, each PU can


perform an operand fetch and computation in a single cycle. Fourth, two different strategies are

in common use for updating the weights in the network. In the first approach, the weights are

updated every cycle for the entire set of training patterns presented. In the second approach, the

network weights are updated continuously after each training pattern is presented. This method

might become trapped for a few atypical training patterns, but the advantage is that it does not

need to accumulate the error value over many patterns presented and allows a network to learn

a given task more quickly, if there is a lot of redundant information in the training patterns. A

disadvantage is that it requires more steps to update weights. The first approach is calledtrue

gradient method and the second approach is calledstochastic gradient method (Petrowski et

al., 1989). The second approach is chosen in this work. Finally,N is assumed to be the number

of neurons at the largest layer.

To perform the recall phase for the MPAA system, the forward procedure, FW-MPAA(l)

represented by pseudo code, is iteratively called forl = 1;2; :::; L at lines (1–3) as shown

in Algorithm 1 of Figure 5. To process FW-MPAA(l) in the MPAB[l], every PUj[l] for all

1 � j � Nl�1 computes weight–by–activation products in parallel by accessing the memory

block row iterativelyNl times as represented at lines (5–12) and these steps are illustrated in

Figure 6 (a),(b).

Then every PUi[l] for all 1 � i � Nl performs sum–of–products in parallel by accessing

the memory block column iterativelyNl�1 times as shown by lines (13–19) and the action is

illustrated in Figure 6 (c). Finally each PUi[l] for all 1� i � Nl finds its new activation value,

sends it to PUi[l + 1], and sendshi[l] to PUi[l + 1] in parallel if learning phase is assumed as

shown by lines (20–26).

As above processes, the MPAA system does not require any inter–PU communication

such as broadcast or shift operation so as to perform sum–of–products, and in turn eliminates

the transposition of weight matrix. This efficient mechanism is applicable to the learning phase.

Therefore, the number of computation steps required to recall a single input pattern on the

MPAA system can be obtained as


Algorithm 1 . RECALL PHASE ON MPAA

f call procedureFW-MPAAiterativelyg1 for l = 1 toL do2 FW-MPAA(l);3 endfor fline 1 forg

f forward procedure on the MPAAg4 procedure FW-MPAA(l)5 for i = 1 toNl do fweight–by–activation productg6 for all 1� j � Nl�1 do7 parbegin8 � PUj [l] readswij [l] by the row,9 PUj [l] computeswij [l]xj[l� 1];10 � PUj [l] writewij [l]xj[l� 1] by the row;11 parend fline 7parbeging12 endfor fline 5 forg

13 for j = 1 toNl�1 do fsum–of–productsg14 for all 1� i � Nl do15 parbegin16 � PUi[l] readswij [l]xj[l� 1] by the column;17 PUi[l] computeshi[l]+ = wij [l]xj[l� 1];18 parend fline 15parbeging19 endfor fline 13forg

20 for all 1� i � Nl do fnew activation valuesg21 parbegin22 � PUi[l] computesxi[l] = f(hi[l]);23 � PUi[l] sendsxi[l] to PUi[l+ 1] in the MPAB[l+ 1];24 � if learning phasethen25 PUi[l] sendshi[l] to PUi[l+ 1] in the MPAB[l+ 1];26 parend fline 21parbeging27 endprocedure fline 4procedureg

Algorithm 2 . LEARNING PHASE ON MPAA

f call procedureFW-MPAAiterativelyg1 for l = 1 toL do2 FW-MPAA(l);3 endfor fline 1 forg

f calculate error vector�[L] at the output layerg4 for all 1 � i � NL do ffor BWg5 parbegin6 � PUi[L] computes(ti � xi[L]);7 � PUi[L] computesf 0(hi[L]);8 � PUi[L] computes�i[L] = f 0(hi[L])(ti � xi[L]);9 parend fline 5parbeging

f call procedureBW-MPAAiterativelyg10 for l = L to 1do11 BW-MPAA(l);12 endfor fline 10forg

f backward procedure on the MPAAg13 procedure BW-MPAA(l)14 for j = 1 toNl�1 do fweight–by–error productg15 for all 1� i � Nl do16 parbegin17 � PUi[l] readswij[l] by the column, ffor BWg18 PUi[l] computeswij[l]�i[l]; ffor BWg19 � PUi[l] writeswij[l]�i[l] by the column; ffor BWg20 � PUi[l] writes�i[l]; ffor updatesg21 parend fline 16parbeging22 endfor fline 14forg

23 for i = 1 toNl do24 for all 1� j � Nl�1 do25 parbegin26 � PUj[l] readswij[l]�i[l] by the row, ffor BWg27 PUj[l] computesdj[l]+ = wij [l]�i[l];28 � PUj[l] reads�i[l]; ffor updatesg29 � PUj[l] computes��i[l]; ffor updatesg30 � PUj[l] computes∆wij [l] = ��i[l]xj [l� 1];31 � PUj[l] readswij[l] by the row, ffor updatesg32 PUj[l] computeswij [l]+ = ∆wij [l];33 � PUj[l] writeswij[l] by the row; ffor updatesg34 parend fline 25parbeging35 endfor fline 23forg

36 for all 1 � j � Nl�1 do ferror vector at layerl� 1g37 parbegin38 � PUj[l] computesf 0(hj [l� 1]);39 � PUj[l] computes�j[l� 1] = f 0(hj[l� 1])dj [l];40 � PUj[l] sends�j[l� 1] to PUj[l� 1] in the MPAB[l� 1];41 parend fline 37parbeging42 endprocedure fline 13procedureg

Figure 5: Algorithms for the recall and the learning phases on the MPAA system.


w11[2]

w21[2]

w31[2]

w12[2]

w22[2]

w32[2]

w13[2]

w23[2]

w33[2]

w14[2]

w24[2]

w34[2]

x1[1] x2[1] x3[1] x4[1]w11[2]x1[1] w12[2]x2[1] w13[2]x3[1] w14[2]x4[1]

Data Layout in Memory Blocks

PU1[2] PU2[2] PU3[2] PU4[2]

(a) Readsw1j [2] and computesw1j [2]xj[1]

w11[2]

w21[2]

w31[2]

w12[2]

w22[2]

w32[2]

w13[2]

w23[2]

w33[2]

w14[2]

w24[2]

w34[2]

x1[1] x2[1] x3[1] x4[1]w11[2]x1[1] w12[2]x2[1] w13[2]x3[1] w14[2]x4[1]

w11[2]x1[1] w12[2]x2[1] w13[2]x3[1] w14[2]x4[1]


PU1[2] PU2[2] PU3[2] PU4[2]

(b) Writesw1j [2]xj[1]

w11[2]

w21[2]

w31[2]

w12[2]

w22[2]

w32[2]

w13[2]

w23[2]

w33[2]

w14[2]

w24[2]

w34[2]

x1[1] x2[1] x3[1] x4[1]

w12[2]x2[1] w13[2]x3[1] w14[2]x4[1]


PU1[2] PU2[2] PU3[2] PU4[2]

w22[2]x2[1]

w32[2]x2[1]

w23[2]x3[1]

w33[2]x3[1]

w24[2]x4[1]

w34[2]x4[1]

h1[2]+=w11[2]x1[1] h2[2]+=w21[2]x1[1] h3[2]+=w31[2]x1[1]

w31[2]x1[1]

w21[2]x1[1]

w11[2]x1[1]

(c) Readswi1[2]x1[1] and computeshi[2]+ = wi1[2]x1[1]

1[2] 2[2] 3[2]1[2] 2[2] 3[2]

w11[2]

w21[2]

w12[2]

w22[2]

w32[2]

w13[2]

w23[2]

w33[2]

w14[2]

w24[2]

w34[2]

x1[1] x2[1] x3[1] x4[1]


PU1[2] PU2[2] PU3[2] PU4[2]

w11[2] w21[2]

w31[2]

w31[2]

(d) Readsw1j [2] and computeswi1[2]�i[2]

1[2] 2[2] 3[2]

w11[2]

w21[2]

w12[2]

w22[2]

w32[2]

w13[2]

w23[2]

w33[2]

w14[2]

w24[2]

w34[2]

x1[1] x2[1] x3[1] x4[1]


PU1[2] PU2[2] PU3[2] PU4[2]

w31[2]w31[2] 3[2]

w21[2] 2[2]

w11[2] 1[2]

(e) Writeswi1[2]�i[2]

1[2] 2[2] 3[2]

1[2]w11[2]

2[2]w21[2]

3[2]w31[2]

w11[2]

w21[2]

w12[2]

w22[2]

w32[2]

w13[2]

w23[2]

w33[2]

w14[2]

w24[2]

w34[2]

x1[1] x2[1] x3[1] x4[1]


PU1[2] PU2[2] PU3[2] PU4[2]

w31[2]

2[2]

3[2]

1[2]

(f) Writes �i[2]

1[2] 2[2] 3[2]

2[2]w21[2]

3[2]w31[2]

1[2]

2[2]

3[2]

1[2] 1[2] 1[2]

2[2] 2[2] 2[2]

3[2] 3[2] 3[2]

2[2]

2[2] 2[2]

2[2] 2[2]

2[2]

1[2]d1[2]+=w11[2] 1[2]d2[2]+=w12[2] 1[2]d3[2]+=w13[2] 1[2]d4[2]+=w14[2]

w11[2]

w21[2]

w12[2]

w22[2]

w32[2]

w13[2]

w23[2]

w33[2]

w14[2]

w24[2]

w34[2]

x1[1] x2[1] x3[1] x4[1]


PU1[2] PU2[2] PU3[2] PU4[2]

w31[2]

w22[2]

w32[2]

w23[2]

w33[2]

w24[2]

w34[2]

w11[2] 1[2] w12[2] 1[2] w13[2] 1[2] w14[2] 1[2]

(g) Readsw1j [2]�1[2] and computesdj [2]+ = w1j [2]�1[2]

1[2] 2[2] 3[2]

2[2]w21[2]

3[2]w31[2]

2[2]

3[2]

2[2] 2[2] 2[2]

3[2] 3[2] 3[2]

2[2]

2[2] 2[2]

2[2] 2[2]

2[2]

1[2] 1[2]1[2]1[2]

w11[2]

w21[2]

w12[2]

w22[2]

w32[2]

w13[2]

w23[2]

w33[2]

w14[2]

w24[2]

w34[2]

x1[1] x2[1] x3[1] x4[1]


PU1[2] PU2[2] PU3[2] PU4[2]

w31[2]

w22[2]

w32[2]

w23[2]

w33[2]

w24[2]

w34[2]

w11[2] 1[2] w12[2] 1[2] w13[2] 1[2] w14[2] 1[2]

w12[2]= w13[2]= w14[2]=

1[2] 1[2] 1[2] 1[2]

w11[2]= x1[1] x4[1]x3[1]x2[1]

(h) Reads�1[2] and computes∆w1j [2] = �xj [1]�1[2]

1[2] 2[2] 3[2]

2[2]w21[2]

3[2]w31[2]

1[2]

2[2]

3[2]

1[2] 1[2] 1[2]

2[2] 2[2] 2[2]

3[2] 3[2] 3[2]

2[2]

2[2] 2[2]

2[2] 2[2]

2[2]

w11[2]+= w11[2]

w11[2]

w21[2]

w12[2]

w22[2]

w32[2]

w13[2]

w23[2]

w33[2]

w14[2]

w24[2]

w34[2]

x1[1] x2[1] x3[1] x4[1]


PU1[2] PU2[2] PU3[2] PU4[2]

w31[2]

w22[2]

w32[2]

w23[2]

w33[2]

w24[2]

w34[2]

w11[2] 1[2] w12[2] 1[2] w13[2] 1[2] w14[2] 1[2]

w12[2]+= w13[2]+= w14[2]+=w12[2] w13[2] w14[2]

(i) Readsw1j [2] and computesw1j [2]+ = ∆w1j [2]

1[2] 2[2] 3[2]

2[2]w21[2]

3[2]w31[2]

1[2]

2[2]

3[2]

1[2] 1[2] 1[2]

2[2] 2[2] 2[2]

3[2] 3[2] 3[2]

2[2]

2[2] 2[2]

2[2] 2[2]

2[2]

w11[2]

w21[2]

w12[2]

w22[2]

w32[2]

w13[2]

w23[2]

w33[2]

w14[2]

w24[2]

w34[2]

x1[1] x2[1] x3[1] x4[1]


PU1[2] PU2[2] PU3[2] PU4[2]

w31[2]

w22[2]

w32[2]

w23[2]

w33[2]

w24[2]

w34[2]

w11[2] 1[2] w12[2] 1[2] w13[2] 1[2] w14[2] 1[2]

(j) Writesw1j [2]

Figure 6: Computation steps in the MPAB[2] for the simple network as in Figure 4.


LXl=1

8<:Nl

0@multiplyz}|{

1 +

memoryz}|{1

1A+

addz }| {Nl�1 +

sigmoidz}|{1 +

comm:z}|{1

9=;

� L

�3

Lmaxl=0

(Nl) + 2

�= L(3N + 2) = O(N ): (5)

Algorithm 2 of Figure 5 for the learning phase by the MPAA system consists of three

major operations: calling of the forward procedure, FW-MPAA(l), as shown in Algorithm 1 of

Figure 5 forl = 1;2; :::; L as lines (1–3), finding of error values at the output layer as lines

(4–9), and calling of the backward procedure, BW-MPAA(l), for l = L;L � 1; :::;1 as lines

(10–12). To process BW-MPAA(l) in the MPAB[l], every PUi[l] for all 1 � i � Nl computes

weight–by–error products and writes error values for the weight update process in parallel by

accessing the memory block column iterativelyNl�1 times as lines (14–22) and these steps are

illustrated in Figure 6 (d–f). Then every PUj[l] for all 1� j � Nl�1 performs sum–of–products

and updates weight values in parallel by accessing the memory block row iterativelyNl times

as illustrated in Figure 6 (g–j). Finally every PUj[l] for all 1 � j � Nl�1 finds error values at

the lower layer and sends them to PUj[l� 1] in parallel as lines (36–41).

According to Algorithm 2 of Figure 5, the number of computation steps required to learn

a single pattern on the MPAA system can be obtained as

forward procedurez }| {LXl=1

(2Nl +Nl�1 + 3)+

calculates �i[L]z}|{3

+

backward procedurez }| {LXl=1

8<:Nl�1

0@multiplyz}|{

1 +

memoryz}|{2

1A +Nl

0@

addz}|{2 +

multiplyz}|{2 +

memoryz}|{2

1A+

sigmoidz}|{1 +

multiplyz}|{1 +

comm:z}|{1

9=;

� 12LN + 6L+ 3 = O(N ): (6)

The MPAA system can support the layer level parallelism and can perform both the recall

and the learning phases in the pipelined fashion. Due to the layer level pipelining, the number

of pipeline stages for the recall phase isL. To perform the pipelined recall phase for the MPAA


Algorithm 3 . PIPELINED RECALL PHASE ON MPAA

f call procedureFW-MPAAin parallelg1 for all 1� l � L do2 parbegin3 � FW-MPAA(l);4 parend fline 2parbeging

Algorithm 4 . PIPELINED LEARNING PHASE ON MPAA

f calculate error vector�[L] at the output layerg1 for all 1� i � NL do ffor BWg2 parbegin3 � PUi[L] computes(ti � xi[L]);4 � PUi[L] computesf 0(hi[L]);5 � PUi[L] computes�i[L] = f 0(hi[L])(ti � xi[L]);6 parend fline 2parbeging

f call procedurePIPELINED-FW-AND-BW-MPAAin parallelg7 for all 1� l � L do8 parbegin9 � PIPELINED-FW-AND-BW-MPAA(l);10 parend fline 8parbeging

f PIPELINED-FW-AND-BW-MPAAprocedure on the MPAAg11 procedurePIPELINED-FW-AND-BW-MPAA(l)12 � BW-MPAA(l);13 � FW-MPAA(l);14 endprocedure fline 11procedureg

Figure 7: Pipelined algorithms for the recall and the learning phases on the MPAA system.

system as shown in Algorithm 3 of Figure 7, the forward procedure, FW-MPAA(l), is simply

called in parallel for all 1� l � L. However, (L� 1) stages of the pipeline should be initially

filled. According to Algorithm 3 of Figure 7, the number of computation steps required to recall

p patterns can be obtained as

0@fill the pipelinez }| {

L� 1 +

p patternsz}|{p

1A0@steps at each layerz }| {

3N + 2

1A = O(pN ): (7)

Because the learning phase consists of the forward and the backward procedures, the

number of pipeline stages for the learning phase of the MPAA system supporting layer level

parallelism is 2L. Algorithm 4 of Figure 7 for the pipelined learning phase algorithm by the

MPAA system performs three major operations for the learning phase in parallel by the layer

level. However, (2L�1) stages of the pipeline should be initially filled. According to Algorithm

4 of Figure 7, the number of computation steps required to learnp patterns can be obtained as

0@fill the pipelinez }| {

2L� 1 +

p patternsz}|{p

1A0@steps at each layerz }| {

12N + 9

1A = O(pN ): (8)


4 Comparison with Various Architectures and Mapping Algo-rithms

In this section, mapping algorithms applied to various parallel machines including one–

dimensional SIMD arrays, two–dimensional SIMD arrays, systolic ring structures, and hyper-

cube systems, are provided and compared in terms of the number of computation steps and

the cost. In order to compare the performance of the proposed schemes with those of previous

works in fair, several algorithms proposed in (Kung & Hwang, 1989; Singer, 1990; Svensson &

Nordstrom, 1990) are rewritten in similar manner to the algorithms on the MPAA system.

4.1 Previous Works

Singer (1990) presented five algorithms showing weight level parallelism through the

Connection Machine. The first method calledgrid–based implementationis considered in this

work. The leftmost column of PUs contains the activation value for each of the input units and

the topmost row of PUs contains the activation values for the hidden layer. The weight matrix

is distributed over the rest of PUs. Therefore the number of total PUs required is(N + 1)2.

The forward procedure requires horizontal broadcast and vertical summation, whereas the

backward procedure requires vertical broadcast and horizontal summation. This paper rewrites

the mapping algorithms of the forward and the backward procedures on the two–dimensional

SIMD array (Singer, 1990) in similar manner to algorithms on the MPAA system as shown in

Figure 8. Each PU in Algorithm 5 and 6 is labeled from PU0;0 to PUN;N .

An algorithm (Svensson & Nordstrom, 1990) calledcommunication by broadcast sup-

porting neuron level parallelism mapped onto one–dimensional SIMD array withN PUs is

considered in this work. In the forward procedure, each PU broadcasts its own activation value

to all other PUs, and then multiplies the value broadcast by a weight value in its local memory,

whereas in the backward procedure it multiplies its own error value and adds the result to a

running sum required across the PUs instead of within each PU. In order to improve inefficient

inter–PU communications in the backward procedure, an adder tree hardware was proposed.

Algorithm 7 and 8 of Figure 9 show the rewritten algorithms of the forward and the backward


Algorithm 5 . FORWARD PROCEDURE ON 2–D SIMD

f forward procedure on the 2–D SIMDg1 procedure FW-2D-SIMD(l)2 if l = odd then3 begin4 for i = 1 toNl do5 for all 1� j � Nl�1 do6 parbegin7 � PUi�1;j sends downxj[l� 1] to PUi;j ;8 parend fline 6parbeging9 endfor fline 4 forg

10 for all 1� i � Nl , 1 � j � Nl�1 do11 parbegin12 � PUi;j readswij [l];13 � PUi;j computeswij [l]xj[l� 1];14 � if learning phasethen15 PUi;j writesxj[l� 1];16 parend fline 11parbeging

17 for k = 1 toNl�1 do18 for all 1� i � Nl, 1 � j � Nl�1 do19 parbegin20 � if j � (Nl�1 � k + 1) then21 PUi;j sends leftwij [l]xj[l� 1] to PUi;j�1;22 � PUi;0 computeshi[l]+ = wik [l]xk[l� 1];23 parend fline 19parbeging24 endfor fline 17forg

25 for all 1� i � Nl , 1 � j � Nl�1 do26 parbegin27 � PUi;0 computesxi[l] = f(hi[l]);28 if learning phasethen29 begin30 � PUi;0 writeshi[l];31 � PU0;j writesxj [l� 1];32 endif fline 29beging33 parend fline 26parbeging34 endif fline 3beging

35 else f l = eveng36 OMITTED because of similar codes in casel = odd37 endprocedure fline 1procedureg

Algorithm 6. BACKWARD PROCEDURE ON 2–D SIMD

f backward procedure on the 2–D SIMDg1 procedure BW-2D-SIMD(l)2 if l = oddthen3 begin4 for j = 1 toNl�1 do ffor BWg5 for all 1� i � Nl do6 parbegin7 � PUi;j�1 sends right�i[l] to PUi;j ;8 parend fline 6parbeging9 endfor fline 4 forg

10 for all 1� i � Nl , 1 � j � Nl�1 do ffor BWg11 parbegin12 � PUi;j readswij [l];13 � PUi;j computeswij [l]�i[l];14 parend fline 11parbeging

15 for k = 1 toNl do ffor BWg16 for all 1� i � Nl, 1 � j � Nl�1 do17 parbegin18 � if i � (Nl � k + 1) then19 PUi;j sends upwij�i[l] to PUi�1;j ;20 � PU0;j computesdj[l]+ = wkj [l]�k[l];21 parend fline 17parbeging22 endfor fline 15forg

23 for all 1� i � Ni , 1 � j � Nl�1 do ffor updatesg24 parbegin25 � PUi;j computes��i[l];26 � PUi;j readsxj[l� 1];27 � PUi;j computes∆wij [l] = ��i[l]xj[l� 1];28 � PUi;j readswij [l];29 � PUi;j computeswij [l]+ = ∆wij [l];30 � PUi;j writeswij [l];31 parend fline 24parbeging

32 for all 1� j � Nl�1 do ffor BWg33 parbegin34 � PU0;j readshj [l� 1];35 � PU0;j computesf 0(hj [l� 1]);36 � PU0;j computes�j [l� 1] = f 0(hj[l� 1])dj[l];37 parend fline 33parbeging38 endif fline 3beging

39 else f l = eveng40 OMITTED because of similar codes in casel = odd41 endprocedure fline 1procedureg

Figure 8: The forward and the backward procedures on the 2–D SIMD system (Singer, 1990).


Algorithm 7 . FORWARD PROCEDURE ON 1–D SIMD

f forward procedure on the 1–D SIMDg1 procedure FW-1D-SIMD(l)2 for j = 1 toNl�1 do3 � PUj sendsxj[l� 1] toControl Unit;4 � Control Unit broadcastsxj[l� 1]

to PUi (1� i � Nl);5 for all 1� i � Nl do6 parbegin7 � PUi readswij [l];8 � PUi computeswij [l]xj[l� 1];9 � PUi computeshi[l]+ = wij[l]xj[l� 1];10 parend fline 6parbeging11 endfor fline 2 forg

12 for all 1 � i � Nl, 1� j � Nl�1 do13 parbegin14 � PUi computesxi[l] = f(hi[l]);15 if learning phasethen16 begin17 � PUi writeshi[l];18 � PUj writesxj[l� 1];19 endif fline 16beging20 parend fline 13parbeging21 endprocedure fline 1procedureg

Algorithm 8. BACKWARD PROCEDURE ON 1–D SIMD

f backward procedure on the 1–D SIMDg1 procedure BW-1D-SIMD(l)2 for j = 1 toNl�1 do ffor BWg3 for all 1� i � Nl do4 parbegin5 � PUi readswij[l];6 � PUi computeswij[l]�i[l];7 � Adder–Tree hardware computes

dj [l]+ = wij[l]�i[l] in log2Nl steps;8 parend fline 4parbeging

9 for i = 1 toNl do ffor updatesg10 � PUi sends�i[l] toControl Unit;11 � Control Unit broadcasts�i[l]

to PUj(1� j � Nl�1;12 for all 1� i � Ni, 1� j � Nl�1 do13 parbegin14 � PUj computes��i[l];15 � PUi readsxj[l� 1];16 � PUj computes∆wij [l] = ��i[l]xj[l� 1];17 � PUi readswij[l];18 � PUi computeswij[l]+ = ∆wij [l];19 � PUi writeswij[l];20 parend fline 13parbeging21 endfor fline 9 forg

22 for all 1� j � Nl�1 do ffor BWg23 parbegin24 � PUj readshj [l� 1];25 � PUj computesf 0(hj [l� 1]);26 � PUj computes�j[l� 1] = f 0(hj [l� 1])dj[l];27 parend fline 23parbeging28 endprocedure fline 1procedureg

Figure 9: The forward and the backward procedures on the 1–D SIMD system (Svensson &Nordstrom, 1990).

procedures on the one–dimensional SIMD array.

Kung and Hwang (1989) proposed an algorithm considering both neuron and layer level

parallelisms mapped onto the cascaded systolic ring structure which is constructed as the same

number of systolic ring structures withN PUs as the number of layers. Therefore the number

of total PUs required isLN . Layer level pipelined operations for patterns are possible for the

recall phase by the cascaded systolic ring structure, but it is impossible for the learning phase.

The forward procedure requires left shifting of each PU’s activation value and the backward

procedure requires left shifting of the accumulated sum. Algorithm 9 and 10 of Figure 10

show in detail the forward procedure and the backward procedure on the cascaded systolic ring

structure in a similar manner to algorithms for the MPAA system.

Malluhi, Bayoumi, and Rao (1995) proposed an algorithmic mapping technique to imple-


Algorithm 9 . FORWARD PROCEDURE ON SYSTOLIC

f define macro for indexg1 #definem(x) (((x+ k � 1) mod Nl�1) + 1)

f forward procedure on the cascaded systolic ringg2 procedure FW-SYSTOLIC-RING(l)3 for k = 0 toNl�1 � 1 do4 for all 1� i � Nl do5 parbegin6 � PUi[l] readswim(i)[l];7 � PUi[l] computesxm(i)[l� 1]wim(i)[l];8 � PUi[l] computeshi[l]+ = wim(i)[l]xm(i)[l� 1];9 � PUj [l] sends leftxm(j)[l� 1]

to PU(((j�2) mod Nl�1)+1)[l];10 parend fline 5parbeging11 endfor fline 3 forg

12 for all 1� i � Nl , 1 � j � Nl�1 do13 parbegin14 � PUi[l] computesxi[l] = f(hi[l]);15 � PUi[l] sendsxi[l+ 1] to PUi[l+ 1];16 � if learning phasethen17 PUi[l] sendshi[l] to PUi[l+ 1];18 parend fline 13parbeging19 endprocedure fline 2procedureg

Algorithm 10 . BACKWARD PROCEDURE ON SYSTOLIC

f define macro for indexg1 #definem(x) (((x+ k � 1)mod Nl�1) + 1)

f backward procedure on the cascaded systolic ringg2 procedure BW-SYSTOLIC-RING(l)3 for k = 0 toNl�1 � 1 do ffor BWg4 for all 1� i � Nl, 1� j � Nl�1 do5 parbegin6 � PUi[l] readswim(i)[l];7 � PUi[l] computes�i[l]wim(i)[l];8 � PUj[l] computesdm(i)[l]+ = �i[l]wim(i)[l];9 � PUj[l] sends leftdm(j)[l]

to PU(((j�2) mod Nl�1)+1)[l];10 parend fline 5parbeging11 endfor fline 3 forg

12 for k = 0 toNl�1 � 1 do ffor updatesg13 for all 1� i � Nl, 1� j � Nl�1 do14 parbegin15 � PUi[l] computes��i[l];16 � PUi[l] computes∆wim(i)[l] = ��i[l]xm(i)[l� 1];17 � PUj[l] sends leftxm(j)[l� 1]

to PU(((j�2) mod Nl�1)+1)[l];18 � PUi[l] readswim(i)[l];19 � PUi[l] computeswim(i) [l]+ = ∆wim(i)[l];20 � PUi[l] writeswim(i)[l];21 parend fline 14parbeging22 endfor fline 12forg

23 for all 1 � j � Nl�1 do ffor BWg24 parbegin25 � PUj[l] computesf 0(hj [l� 1]);26 � PUj[l] computes�j[l� 1] = f 0(hj[l� 1])dj[l];27 � PUj[l] sends�j[l� 1] to PUj[l� 1];28 parend fline 24parbeging29 endprocedure fline 2procedureg

Figure 10: The forward and backward procedures on the cascaded systolic ring structure (Kung& Hwang, 1989).


Table 1: Comparison with other mapping schemes.architecture no. of what level no. of comp. no. of comp. no. of comp. no. of comp. steps& algorithms processors of parall– steps for one steps forp pattern steps for one for p pattern learning

elism ? pattern recall recall (pipelining ?) pattern learning (pipelining ?)

2–D SIMD (N + 1)2 weight L(3N + 3) NO ! 6LN + 17L + 3 NO !

= O(N2) = O(N) pL(3N + 3) = O(N) p(6LN + 17L + 3)= O(pN) = O(pN)

1–D SIMD N = O(N) neuron L(5N + 1) NO ! LN log2 N+ NO != O(N) pL(5N + 1) 15LN + 6L + 3 p(LN log2 N+

= O(pN) = O(N log2 N) 15LN + 6L + 3)= O(pN log2 N)

Systolic Ring LN = O(N) neuron, L(4N + 2) YES ! 14LN + 6L + 3 NO !layer = O(N) p(4N + 2) = O(N) p(14LN + 6L + 3)

+(L� 1)(4N + 2) = O(pN)= O(pN)

Hypercube 4N2 = O(N2) weight, 2L(log2 N + 2) YES ! 4L(log2 N + 3) NO !MIMD layer = O(log2 N) (2L(log2 N + 2) + 1 = O(log2 N) 4pL(log2 N + 3)

+2 log2 N)b p2 log2 N+2

c = O(p log2 N)

+2L(log2 N + 2)+(p mod (2 log2N + 2))�1 = O(p log2 N)

MPAA LN = O(N) neuron, L(3N + 2) YES ! 12LN + 6L + 3 YES !layer = O(N) p(3N + 2)+ = O(N) p(12N + 9)+

(L� 1)(3N + 2) (2L � 1)(12N + 9)= O(pN) = O(pN)

ment the multilayer perceptron with backpropagation learning and the Hopfield ANN models

on the mesh–appendixed tree (MAT) structure, and then embedded into the hypercube MIMD

architecture, considering both weight and layer level parallelisms. It can take the optimal

computation steps,O(log2N), for both the recall and the learning phases. However, those

algorithmic steps are obtained at the expense of 4N2 MIMD processors. Because the maximum

number of patterns that can be concurrently placed in the pipeline is 2 log2N + 2, it requires

pooled updates in the pipelined fashion for consecutive patterns. Also it can support the pattern

pipelining only for the recall phase.

The number of computation steps on above architectures with corresponding mapping

algorithms can be obtained in a similar manner to the mapping algorithms on the MPAA system

as shown in Table 1.

4.2 Performance Comparison

Various architectures for ANN with corresponding algorithms including the MPAA system

are compared in terms of the cost as well as the number of computation steps. The cost of solving

an ANN problem on any specific parallel machine is defined as the product of the number of

computation steps and the number of processors used (Kumar et al., 1994). The cost reflects

the sum of the numbers of computation steps for all processors, where each number stands for


computation steps that each processor spends solving the problem. Table 1 shows the number

of computation steps for the recall phase to process a single pattern, called one pattern recall

phase, in the same way,p pattern recall phase, one pattern learning phase, andp pattern learning

phase for various architectures and algorithms as explained in the previous sections 3.2 and

4.1. Each parameter is given values such asN = 64; L = 3, andt = 1000. Also, changes in

these parameters influence the number of processors, the number of layers, and the number of

training patterns. Figure 11 and Figure 12 compare the number of computation steps and the

cost, respectively, of various architectures and algorithms according to Table 1.

As shown in Figure 11, the performance of the hypercube (Malluhi, Bayoumi, & Rao,

1995) is outstanding over the other schemes in terms of the number of computation steps.

However, it is not true of the cost as shown in Figure 12. Therefore, the hypercube scheme

is not cost–effective. The number of computation steps of the one–dimensional SIMD array

(Svensson & Nordstrom, 1990) is larger than the other schemes. The MPAA system is the best

in terms of the number of computation steps among those SIMD configurations. In terms of

the cost and performance, the MPAA system is the best among all the system types illustrated.

Figure 11 (b) and (e) show that the number of computation steps of the MPAA system hardly

increases as the number of layers increases. This phenomenon is caused by the fact that the

MPAA system can support the layer level pipelined parallelism for both the recall phase and the

learning phases. Because the mapping method to the cascaded systolic ring structure (Kung &

Hwang, 1989) can support the layer level pattern pipelining for only the recall phase, it provides

a good performance only for the recall phase as shown in Figure 11 (b).

As shown in Figure 12, although the number of computation steps of the one–dimensional

SIMD array (Svensson & Nordstrom, 1990) is larger than the other schemes, the cost of it is

the third best and the second best in the point of the recall and the learning phases, respectively.

The one–dimensional SIMD array (Svensson & Nordstrom, 1990) must be more cost–effective

than the two–dimensional SIMD array (Singer, 1990) and the hypercube (Malluhi, Bayoumi,

& Rao, 1995). Similarly, the two–dimensional SIMD array (Singer, 1990) is inferior to the

other schemes in terms of cost–effectiveness. Also, though the cost of the MPAA system for


the recall phase is similar to those of other schemes (Malluhi, Bayoumi, & Rao, 1995; Kung

& Hwang, 1989; Svensson & Nordstrom, 1990) excluding the two–dimensional SIMD array

(Singer, 1990), the cost of the MPAA system for the learning phase is much smaller than those

of the others, because the other schemes cannot support any layer level pipelined parallelism for

the learning phase.

Consequently, the cost of the MPAA is superior to those of the others owing to the novel

architecture based on the memory and processor integration, resulting in the elimination of

inter–PU communications and matrix transpositions. Furthermore, can be provided by efficient

algorithms supporting the layer level pipelining for both the recall and the learning phases. The

MPAA system can reduce 24.81%� 98.49% of the cost for one thousand training patterns

obtained by any other architectures with their corresponding algorithms for an example ANN

consisting of three layer perceptron with 64 neurons at the largest layer.

5 Conclusions

An effective architecture, the MPAA system, and its associated algorithms for the artificial

neural networks are presented in this work. The proposed MPAA system provides an efficient

mechanism for the matrix–by–vector operations without the inter–PU communications and

the matrix transpositions. The proposed algorithms can exploit both neuron and layer level

parallelisms, and also allows the layer level pipelining for both the recall and the learning

phases. The asymptotic time complexities of the proposed algorithms are evaluated to verify

the effectiveness of the MPAA system. The performance of the hypercube scheme is shown to

be outstanding, but it is not cost effective. Graphs of MPAA cost show it to be superior in each

phase even though its effect is not so dramatical. Consequently, it is verified that the proposed

scheme has a relative performance improvement over typical four parallel architectures with

their corresponding algorithms in terms of the cost.

REFERENCES


Aimoto, Y., et. al. (1996). A 7.68GIPS 3.84GB/s 1W parallel image processing RAM

integrating a 16Mb DRAM and 128 processors,Dig. Tech. Papers, 1996 IEEE Int’l

Solid–State Circuit Conf., pp. 372 – 373.

El–Amawy, A. & Kulasinghe, P. (1997). Algorithmic mapping of feedforward neural

networks onto multiple bus systems,IEEE Trans. Parallel and Distributed Systems,

8(2), 130 – 136.

Elliott, D., Snelgrove, M., & Stumm, M. (1992). Computational RAM: a memory-SIMD

hybrid and its application to DSP,IEEE 1992 Custom Integrated Circuit Conf., pp.

30.6.1 – 30.6.4.

Ghosh, J. & Hwang, K. (1989). Mapping neural networks onto message–passing multi-

computers,J. Parallel and Distributed Computing, 6, 291 – 330.

Gokhale, M., Holmes, B., & Iobst, K. (1995). Processing in memory: the terasys

massively parallel PIM array,IEEE Computer, 28(4), 23 – 31.

Inoue, K., Nakamura, H., & Kawai, H. (1995). A 10Mb frame buffer memory with

Z–compare and A–blend units,IEEE J. of Solid–State Circuits, 30(12), 1563 –

1568.

Kumar, V., Grama, A., Gupta, A., & Karypis, G. (1994a).Introduction to parallel com-

puting, Design and analysis of algorithms. The Benjamin/Cummings Publishing

Company, Inc.

Kumar, V., Shekhar, S., & Amin, M.B. (1994b). A scalable parallel formulation of the

backpropagation algorithm for hypercubes and related architectures,IEEE Trans.

Parallel and Distributed Systems, 5(10), 1073 – 1090.

Kung, S.Y. & Hwang, J.N. (1989). A unified systolic architecture for artificial neural

networks,J. Parallel and Distributed Computing, 6, 358 – 387.

Lin, W., Prasanna, V.K., & Przytula, K.W. (1991). Algorithmic mapping of neural

network models onto parallel SIMD machines,IEEE Trans. Computers, 40(12),

1390 – 1401.


Malluhi, Q.M., Bayoumi, M.A., & Rao, T.R.N. (1995). Efficient mapping of ANNs on

Hypercube massively parallel machines,IEEE Trans. Computers, 44(6), 769 – 779.

Naylor, S. & Jones, S. (1994). A performance model for multilayer neural networks in

linear arrays,IEEE Trans. Parallel and Distributed Systems, 5(12), 1322 – 1328.

Nordstrom, T. & Svensson, B. (1992). Using and designing massively parallel computers

for artificial neural networks,J. Parallel and Distributed Computing, 14(3), 260 –

285.

Petrowski, A., Personnaz, L., Dreyfus, G., & Girault, C. (1989). Parallel implementations

of neural networks simulations,Hypercube and Distributed Computers, North–

Holland, New York, pp. 205 – 218.

Shimizu, T., et. al. (1996). A multimedia 32b RISC microprocessor with 16Mb DRAM,

Dig. Tech. Papers, 1996 IEEE Int’l Solid–State Circuit Conf., pp. 216 – 217.

Kogge, P.M. (1994). Execube – a new architecture for scalable MPPs,Proc. Int’l Conf.

Parallel Processing, vol. I, pp. 77 – 84.

Singer, A. (1990). Implementations of artificial neural networks on the Connection

Machine,Parallel Computing, 14(3), 305 – 316.

Svensson, B. & Nordstrom, T. (1990). Execution of neural network algorithms on an

array of bit–serial processors,10th Int’l Conf. Pattern Recognition, Comp. Arch.

for Vision and Pattern Recognition, vol. II. Atlantic City, NJ, pp. 501 – 505.

Wah, B.W. & Chu, L. (1990). Efficient mapping of neural networks on multicomputers,

Proc. Int’l Conf. Parallel Processing, vol. I, pp. 234 – 241.

Yamashita, N., et. al. (1994). A 3.84 GIPS integrated memory array processor with 64

processing elements and a 2-Mb SRAM,IEEE J. of Solid-State Circuits, 29(11),

1336 – 1342.


1000

10000

100000

1e+06

1e+07

20 40 60 80 100 120 140 160 180 200

Number of neurons at the largest layer (N )

2D SIMD [1]

1D SIMD [2]

SYSTOLIC [3]

HYPERCUBE [4]

MPAA

(a) Recall phase(L = 3; t = 1000)

1000

10000

100000

1e+06

1e+07

1 2 3 4 5 6 7

Number of layers (L)

2D SIMD [1]

1D SIMD [2]

SYSTOLIC [3]

HYPERCUBE [4]

MPAA

(b) Recall phase(N = 64; t = 1000)

1000

10000

100000

1e+06

1e+07

1 10 100 1000 10000

Number of patterns (p)

2D SIMD [1]

1D SIMD [2]

SYSTOLIC [3]

HYPERCUBE [4]

MPAA

(c) Recall phase(N = 64;L = 3)

1000

10000

100000

1e+06

1e+07

20 40 60 80 100 120 140 160 180 200


2D SIMD [1]

1D SIMD [2]

SYSTOLIC [3]

HYPERCUBE [4]

MPAA

(d) Learning phase(L = 3; t = 1000)

1000

10000

100000

1e+06

1e+07

1 2 3 4 5 6 7


2D SIMD [1]

1D SIMD [2]

SYSTOLIC [3]

HYPERCUBE [4]

MPAA

(e) Learning phase(N = 64; t = 1000)

1000

10000

100000

1e+06

1e+07

1 10 100 1000 10000


2D SIMD [1]

1D SIMD [2]

SYSTOLIC [3]

HYPERCUBE [4]

MPAA

(f) Learning phase(N = 64;L = 3)

Figure 11: Comparison with other mapping schemes in terms of the number of computationsteps.


100000

1e+06

1e+07

1e+08

1e+09

1e+10

20 40 60 80 100 120 140 160 180 200


2D SIMD [1]

1D SIMD [2]

SYSTOLIC [3]

HYPERCUBE [4]

MPAA

(a) Recall phase(L = 3; t = 1000)

100000

1e+06

1e+07

1e+08

1e+09

1e+10

1 2 3 4 5 6 7


2D SIMD [1]

1D SIMD [2]

SYSTOLIC [3]

HYPERCUBE [4]

MPAA

(b) Recall phase(N = 64; t = 1000)

100000

1e+06

1e+07

1e+08

1e+09

1e+10

1 10 100 1000 10000


2D SIMD [1]

1D SIMD [2]

SYSTOLIC [3]

HYPERCUBE [4]

MPAA

(c) Recall phase(N = 64;L = 3)

100000

1e+06

1e+07

1e+08

1e+09

1e+10

20 40 60 80 100 120 140 160 180 200


2D SIMD [1]

1D SIMD [2]

SYSTOLIC [3]

HYPERCUBE [4]

MPAA

(d) Learning phase(L = 3; t = 1000)

100000

1e+06

1e+07

1e+08

1e+09

1e+10

1 2 3 4 5 6 7


2D SIMD [1]

1D SIMD [2]

SYSTOLIC [3]

HYPERCUBE [4]

MPAA

(e) Learning phase(N = 64; t = 1000)

100000

1e+06

1e+07

1e+08

1e+09

1e+10

1 10 100 1000 10000


2D SIMD [1]

1D SIMD [2]

SYSTOLIC [3]

HYPERCUBE [4]

MPAA

(f) Learning phase(N = 64;L = 3)

Figure 12: Comparison with other mapping schemes in terms of the cost.

Documents

Mapping of Neural Networks onto the Memory-Processor …algo.yonsei.ac.kr/international_JNL/MNnetworks98Kim.pdf · 2013-02-13 · Mapping of Neural Networks onto the Memory-Processor