Mapping artificial neural networks to a tree shape neurocomputer

ELSEVIER Microprocessors and Microsystems 20 (1996) 267-276

MICIII~II3CESSORS AND

MICROSYSTEMS

Mapping artificial neural networks to a tree shape neurocomputer

Harri Klapuri a, Timo H im il/iinen a, Jukka Saarinen b, Kimmo Kaski a aTampere University of Technology, Electronics Laboratory, P.O. Box 692, FIN-33101 Tampere, Finland

bTampere University of Technology, Signal Processing Laboratory, P.O. Box 553, FIN-33101 Tampere, Finland

Received 5 January 1996; revised 20 April 1996; accepted 23 April 1996

Abstract

We present a parallel algorithm which trains artificial neural networks on the TUTNC (Tampere University of Technology NeuroComputer) parallel tree shape neurocomputer. The neurocomputer is designed for parallel computation and it is suitable for several artificial neural network models. Detailed mathematical notations are used to describe the parallel back-propagation algorithm. Performance analyses show that we can effectively utilize both broadcasting and global adding between processing units in the presented parallel implementation. By using the given algorithm it is possible to add more processing units to the system without any change in the number of communication transactions required.

Keywords: Neurocomputers; Parallel algorithm; Back-propagation

1. Background

Research groups working in the area of neural network computation are witnessing growing interests from numerous fields of industry. One of the principal reasons for this new appearance of neural systems since the 1980's is technological. The significant development is computer technology has made neural computation available to an ever-increasing number of applications. Even today, neural networks are most often implemented on conventional computers which process data sequentially.

Neural networks can be characterized as inherently parallel computing models and for that reason they can be implemented using parallel processing techniques. A typical parallel computer architecture dedicated to neural computation consists of several processing elements, an inter-connecting routing network between the elements, and a mechanism to support and control all the dements. Another interesting approach to neural computation is the usage of special very large scale inte- gration (VLSI) chips. These neural chips can be designed to perform a particular algorithm or some time-critical function of the algorithm. Most neural chips have been developed in university research projects (e.g. TlnMANN

a Tel: +358-31-316 3399, Fax: +358-31-316 2620, Email: harrik@ ele.tut.fi.

0141-9331/96/$15.00 © 1996 Elsevier Science B.V. All rights reserved PII S0141-9331(96)01091-5

[1], TUTHOP [2]), but commercial neural chips are also available (e.g. ETANN [3], Nil000 [4]). In addition, it is possible to perform neural computation by taking advantage of add-in accelerator cards hosted by personal computers or workstations. In some systems accelerator cards are used together with neural chips which perform the time-critical parts of computation. On-board memory and fast data buffering between the card and host computer are usually necessary. Some commercial cards are available, such as ANZA Plus [5], SIGMA-1 [6], ODYSSEY [7] and MARK IV [7].

Making use of neural chips or accelerator cards in neural network computations causes some limitations to the achievable degree of complexity and performance. Large networks with numerous connections consume a lot of memory. Moreover, the training phase of a neural network algorithm can need quite a lot of computational capacity taking a great deal of time if efficient hardware is not available. To overcome this, specialized neural computer systems, also referred to as neurocomputers, have been introduced. A massively or highly parallel architecture is a common characteristic for them.

Neurocomputers can be classified into two categories according to their intended use. Special-purpose neurocomputers are aimed at a more precisely specified application than their opposites which are called general-purpose neurocomputers. A special-purpose computer is not designed with versatility in mind so they

268

usually have hardware level support for one or a small number of related neural network algorithms only. This results in a computing system which is not easily software configurable when necessary. General-purpose neurocomputers are fully programmable and thus well suited for a large number of neural network algorithms.

A variety of neurocomputer systems have been introduced, most of them created in university research projects (e.g. BSP400 [8], MUSIC [9]). Recently, field programmable gate array (FPGA) technology has been applied widely to design general purpose neurocomputers, such as REMAP3 [10]. As commercial neural computing applications have increased along with theoretical foundations, some companies have also produced powerful neurocomputers (e.g. CNAPS [11], SYNAPSE-1 [12]).

In this paper we discuss parallel realizations of the multilayer perceptron model. Multilayer perceptrons consist of artificial neurons which are simple information processing units. Every neuron is connected to a set of input signals and each connection is characterized by a weight. To obtain the output of a neuron, its input signals are multiplied by the corresponding weights, after which these weighted input signals are summed up in order to get the activation signal. Thereafter an activation function is applied to the activation signal. Thus, the artificial neuron is expressed by the equation

0 ~- g WiX i ( 1 )

Nonlinear and differentiable activation functions are commonly used. The sigmoidal nonlinearity is defined by

1 g ( x ) - 1 + e x (2 )

A multilayer perceptron is constructed by forming a layered network structure as shown in Fig. 1. A perceptron

input nodes hidden layer output layer

input hidden layer output layer output signals weights weights signals

Fig. 1. Multilayer perceptrons usually consist of a hidden layer and an output layer. There are no connections between neurons belonging to the same layer but adjacent layers are fully connected. After feeding an input pattern to the input nodes, information propagates through the perceptron and finally a result can be obtained from the output signals.

H. Klapuri el al./Microprocessors and Microsystems 20 (1996) 267 276

must be trained by appropriately adjusting its weights before it can be utilized. Multilayer perceptrons can be trained to form an arbitrarily close approximation to any nonlinear function or decision boundary which makes them suitable for a variety of applications [13].

We use back-propagat ion to train multilayer perceptrons on a general-purpose neurocomputer. The used hardware configuration is introduced prior to presenting methods which parallelize multilayer perceptrons for execution on the parallel neurocomputer. Moreover, we analyse some performance measurements in order to determine how well the neurocomputer lends itself to neural network simulations. The performance tests are followed by conclusions.

2. T U T N C neurocomputer

T U T N C (Tampere University of Technology Neuro- Computer) is a general-purpose neurocomputer developed in the Electronics Laboratory at Tampere University of Technology. The motivation for the designers of the neurocomputer was to construct a flexible parallel computer system with a good cost/performance ratio. This was achieved by using low-cost, yet efficient, commercially available components [14].

The architecture of the system consists of processing units interconnected by a tree shape network of communication units and a single interface unit as shown in Fig. 2. To guarantee simple implementation all the processing units are identical. The same applies to the communication units which are the interconnecting nodes in the trunk of the communication network. The current prototype has four processing units and three communication units. More information on the neurocomputer is given in appendix A.

j o o o

, ,

Fig. 2. The TUTNC neurocomputer is based on a tree shape architecture. An interface unit (IFC) connects the neurocomputer to a host computer. The system can be expanded by adding more processing units (PU) and communication units (CU) to the communication network. The only limitation is that the number of processing units must equal to a power of two.

1t. Klapuri et al./Microprocessors and Microsystems 20 (1996) 267-276 269

2.1. Parallel programming

The comparison of distributed and shared memory computers has been one of the most interesting fields of research in massively parallel processing in the past years. Although shared memory computers are usually easy to program and compiler support is extensive, their performance can only be increased to a certain limit by adding more processors. On the other hand, a distributed memory machine is scalable to a very large number of processors since the memory bandwidth of the system can be improved by expanding the global communication network. From the point of view of the programmer, however, distributed memory computers can be difficult to use if code and data must be explicitly distributed among the processors under programmer's control.

The message passing programming model has been widely used on certain classes of parallel computers, especially those with distributed memory [15]. However, we consider data parallelism as the basic paradigm for software development on our neurocomputer. In the data parallel programming model a single program runs on all processors. Parallel data structures produced by the processors are automatically communicated since computation and communication alternate with implicit synchronization.

A neural network algorithm implementation for our neurocomputer consists of two components which are the actual algorithm software running on all the processing units and the control software running on the host computer. Neural algorithms were developed with Texas Instruments' TMS320C2X/C5X Optimizing C Compiler [16]. Moreover, a testing environment and a graphical user interface were developed on top of the Windows operating system. The user interface has a significant role in the overall system, targeted to provide the users with an easy-to-use and consistent graphical interface they can use for network initialization and visualization of results. Host software development was carried out using Borland C++ v4 [17].

3. Parallel back-propagation considerations for neurocomputer training

The back-propagation algorithm is a highly popular algorithm for training multilayer perceptions. The algorithm was presented by Werbos in 1974 [18], but it was not widely known before the publication of the book by Rumelhart and McClelland in 1986 [19]. Back-propagation is an iterative gradient descent algorithm based on the least squares method [20,21]. The training algorithm consists of two passes through the layers of the network. During the forward pass an input pattern is applied to the input nodes of the network and propagated layer by layer, producing an output

pattern as a response of the network. In the backward pass the obtained pattern is subtracted from the corresponding desired output pattern (or target) to produce an error term. This error term is then propagated back- wards through the network and the weights are adjusted in order to move the output pattern closer to the target. The back-propagation (or sequential back-propagation, as we refer to it in this paper) algorithm is further illustrated in Fig. 3. See Appendix B for a description of the notations used in the algorithm.

Our neurocomputer architecture supports several ways of implementing the back-propagation training algorithm in a parallel fashion. The constraints imposed by this particular algorithm must be evaluated with care to fully utilize available hardware resources [22]. For that reason the levels of parallelism applicable to the back- propagation algorithm are discussed here.

It is evident after examining the artificial neuron model that neural algorithms are founded on relatively simple subcomputations, such as additions and multiplications. Unfolding the computations involved in a neural algorithm reveals the different ways of parallelism. The structure of the back-propagation algorithm can be reviewed using a loop-nest as follows:

For each training sample of the training set, for each layer of the network,

for each neuron in the layer, for each weight of the neuron,

for each bit of the weight, perform subcomputations.

This approach leads to the conclusion that it is possible to implement algorithm parallelism at five different levels: training sample parallelism, layer parallelism, neuron parallelism, weight parallelism and bit parallelism [10].

3.1. Layer parallelism

In the back-propagation algorithm a training sample passes the network layer by layer, first in the forward pass and then again in the backward pass. Layer parallelism is achieved if every processing unit assumes responsibility for the computations needed to process one layer of the network. This organization makes it possible to pipeline the layers for several training samples to pass through the network at the same time. There is, however, a major drawback involved in the described method since it implies that weight updating is performed after presenting all the training samples. This is called the batch mode of back-propagation learning [20,211.

In the pattern mode of learning weight updating is performed after the presentation of each training sample. Even though the back-propagation algorithm is theo- retically based on batch learning, pattern mode usually

270 H. Klapuri et al./Microprocessors and Microsystems 20 (1996) 267-276

Pick an input vector x and its target t from the training set.

Compute the outputs of the neurons, starting from the first hidden layer. The output N1_1

ofthejth neuron in layer l is given by net t j = ~ , w t j i o t_ 1,i and olj = g(ne t l j ) .

i = 1

I Compute the error terms for the output neurons: eLj = (tj--OLj)gY~netLj ) . If the

sigrnoidal nonlinearity is used we obtain e Lj = ( t j - O Lj)O Lj ( 1 -- O Lj ) .

Compute the error terms for the hidden neurons, starting from the last hidden layer. If the sigrnoidal nonlinearity is used the error term for thejth hidden neuron in lay-

Nt+l

er I is etj = Olj( 1 - Olj ) ~" e I + 1, iwl + 1, ij"

i=1

Update the connection weights, starting from the output layer. A m o m e n t u m parameter is used to speed up learning. The ith weight of thejth neuron in layer l is

updated as shown by AWlj i = ~ A p W l j i + r lel jOl_ 1, i'

~ No

Fig. 3. The sequential back-propagation training algorithm with a momentum parameter [20] is shown. When computing the weight-change values a fraction of the previous change is added to keep the weight changes going in the same direction. This prevents the learning process from terminating in a local minimum on the error surface.

input nodes hidden layer I hidden layer 2 output layer

converges noticeably faster. Insisting on the pat tern mode effectively rejects the use o f layer parallelism because a training sample can not be fed to the network before the previous sample has completed its backward pass.

3 .2 . N e u r o n p a r a l l e l i s m

Neuron parallelism is the plainest way to implement the back-propagat ion algori thm on parallel hardware. Unlike layer parallelism which at tempts to parallelize the algori thm as a whole, neuron parallelism concen- trates on one layer at a time. N o pipelining is involved, i.e. network layers are processed sequentially. Parti t ion- ing o f layers for P processing units is illustrated in Fig. 4.

inputs hidden layer 1 hidden layer 2 output layer outputs weights weights weights

Fig. 4. In neuron parallelism neurons are organized into subsets and every processing unit (PU) processes its own subset of neurons. The weights of an individual neuron are stored in the corresponding processing unit while each processing unit has a local copy of the entire training set.

3 .3 . W e i g h t p a r a l l e l i s m

Weight parallelism is achieved by parallelizing the computa t ion o f the weighted inputs to a part icular

H. Klapuri et al./Microprocessors and Microsystems 20 (1996) 267-276 271

PU

Fig. 5. In weight parallelism the computat ion of the weighted inputs to a particular neuron is done in parallel. Every processing unit performs the weighting multiplication needed and sends the product to the host computer via the communication network. This network sums up the downwards going data words and the host computer applies the activation function to the activation signal it receives.

neuron. Weight parallelism is illustrated in Fig. 5. Although elegant and certainly operative, weight parallelism was not chosen to be realized because of the suspected degra- dation in performance caused by the large amount of communication overhead between the host computer and the processing units.

4. Neuron-parallel algorithm

The back-propagation algorithm makes use of nonlinear activation functions, such as the sigmoidal nonlinearity. In order to avoid floating point computations nonlinear functions are often approximated by piece- wise-linear functions. The most efficient method, however, is to create a lookup table containing precalcu- lated activation function values. The table is accessed during algorithm execution whenever the activation function is to be applied.

We present a powerful neuron-parallel back- propagation algorithm for our neurocomputer. The principle of the algorithm was introduced by Kotilainen et al. [23]. The forward pass of the parallel algorithm for a network with one hidden layer is shown in Fig. 6. The backward pass is shown in Fig. 7. The notations used in the algorithm are explained in Appendix B.

I" "1

i Pick an input vector x and its target t from the training set. i I . "J

___-4___ f o r j = 1 to H l do f o r j = 1 to H P do I

x X I I

net~j = ~ w~jix i netPlj = ~ w~.iXi l i=l i=l I

I h) = g(net~j) h f = g(net~j) I

e n d for e n d for I . I

L 4;

I ' [': ..... ' " ..... ' ; . . . . . hPHP]= Ihl ..... hH]"

- - ! m B D

4 t " . . . .

f o r j = 1 to y1 do H

netlj = E wljihi i = 1

y) = g(netlj)

e n d for

P

f o r j = 1 to YP do H

net~j = E w~jihi i = l

y ; = g(net~.) end for

I Yt I I YP I I Els = E (t)--y))2 I " ' ' I E P = E ( t f - y f ) 2 i

I i = 1 I I / = 1 I . I I_ ..I L

Fig. 6. The forward pass of the parallel back-propagation algorithm for a total of P processing units is shown. Dashed rectangles are used to indicate the steps in which the processing units execute in parallel. In the second step the outputs of the hidden neurons are computed. After that in the third step the host reads the hidden layer outputs from the processing units to form the total hidden layer output vector h. This vector must be broadcast to all the processing units since it is fed as input to the output layer in step four. In the fifth step the processing units compute in parallel the sums of the squared errors of the output neurons. The host reads these partial sums and receives the total sum for this training sample by commanding the communication network to sum up the words. The host determines if the algorithm is continued when the sum of squared errors of the entire training set has been obtained.

5. Performance

The most serious bottleneck in a parallel computer architecture is that the ratio of the nonparallelizable sequential overhead to the parallelizable part of an algorithm is too big in some cases. The situations which require sequential processing usually arise when the processing units need to intercommunicate with each other. Although a conventional global bus can provide efficient broadcast operations it cannot realize powerful network adders. In our neurocomputer architecture this problem is partially overcome by employing a tree shape communication network with built-in global adding functions. Several research groups working on neural network

hardware implementations have observed the need for a specialized communication network with a built-in global adder (one example is presented by MiiUer et al. [9]). If specialized hardware is not available a method sug- gested by Yoon et al. [24] has been used to program back-propagation on distributed memory computers. In this method excessive communication is avoided by storing redundant copies of weights in processing elements. The cost for this arrangement is that the modified back-propagation algorithm involves some additional computation and requires more memory. Next we consider the performance of the presented parallel back- propagation algorithm in terms of its time complexity. Also, some measured performance ratings are shown.

272 H. K l a p u r i e t a l . / M i c r o p r o c e s s o r s a n d M i c r o s y s t e m s 2 0 ( 1 9 9 6 ) 2 6 7 2 7 6

r _ _ - 3 - , r _ _ _ i I f o r j = 1 t o y I d o I I f o r j = t to YP d o I

P P P P I el2j : ( , ) - y ) ) y ) ( l - y ) ) I * ' * I e~. = (t) - y ) ) 3 , ) ( 1 .)~ )1

I e n d for I I e n d for I

. . . . ~_ _ _+_, r - - - _*_ _.~_~ 2 r

I r~ I 1 )'P I

~" 1 w I I ooe I s u m P = ~ e P w P I ] s u m l = Z.a e2i 2ij 2i 2ij I i = 1 I I t = l I

F ~ ~ ] ~

3 r - - - ~ - - - ~ I P I H

I s u m = ~ sumP [

I .o I p = l

-Iv

4 l elk='um( --hZN [ I

. . . . . • ~_ . . . . . . . . . . . . . . t 5 • r

I f o r j = 1 t o y l d o I f o r j = I to YP do I f o r i = l t o H d o I f o r i = l t o H d o

I A p w l j i I + 1 I z~pw~j i = c tApwP + rleP.h = aAPW2j i l ' le2jhi e * . z)* zj I I I w l j i = 1 + w 1 2ji zji W2y i Ap 2ji I Wffj i = w P + A p w P

I e n d for I e n d for I e n d for I e n d for

, . . . . .

r . . . . . 6 r I f o r j = l t o H 1 d o I f o r j = l t o H e do I f o r i = l t o l d o I for i = l to l do

I Apw~j i : a A p w l j i + r l e l j x i I Apw~i i = aApwPi/i + q e f i x

I e n d for I e n d for I e n d for I e n d for

Fig. 7. The backward pass of the parallel back-propagation algorithm is shown. In the first step the processing units compute in parallel the error terms for the neurons in the output layer. In the second step the implementation of the algorithm gets complex such that every error term in the hidden layer depends on all the error terms in the output layer. Neuron parallelism has now to be abandoned due to the fact that the output layer is partitioned between the processing units. Instead, a weight-parallel approach is exploited. Steps two, three, and four are iterated H times. During each iteration the partial sums shown in step two are first computed in parallel. After that the processing units send these partial sums to the host via the communication network. The communication network sums up the downwards going data words and the host computer receives the complete sum. The host writes the sum to the processing unit denoted by P U ~ in which t h e j t h hidden neuron in terms of the entire network resides. Assuming that the j th hidden neuron is the kth local hidden neuron in P U x, the error term can be obtained as shown in step four. It should be noted that the other processing units are idle while P U x performs the fourth step. After the weight parallel section the processing units update in parallel the weights of the output and hidden neurons in steps five and six, respectively.

5.1. T ime complex i t y

A stepwise performance analysis of the parallel back- propagation algorithm is shown in Table 1. In practice we usually have Y = O ( X ) in which case the time usage of the algorithm is upper-bounded by O(X-~ + H log P).

In order to find out how much we benefit from built-in broadcasting and global adding in the shown parallel

back-propagation algorithm we compare four different parallel architectures. Architecture A has neither broadcasting nor global adding capability. This could be a conventional multiprocessor based on a global bus with one-to-one communication facilities only. Architec- ture B has broadcasting but no global adding capability. The majority of existing parallel computers are like this. Architecture C has no broadcasting but does have global adding capability. A computer of this kind might be based on a tree shape network with communication units that do not enable broadcasting. Architecture D is comparable to our neurocomputer architecture since it is capable of both broadcasting and global adding. The results of this comparison are shown in Table 2 which shows the number of communication network transactions in a back-propagation training cycle for these four parallel architectures.

Table 2 implies that it is more advantageous to be able to add globally than broadcast if the number of communication network transactions is to be cut down to a minimum in the parallel back-propagation algorithm. The difference, however, is likely to be rather insignifi- cant. Consider, for instance, a parallel system with 16 processing units training a multilayer perceptron of 30 hidden neurons using the given parallel algorithm. If the system complies with Architecture A the number of communication transactions in a single training cycle is 1156. The corresponding figures for Architectures B, C, and D are 586, 571, and 121, respectively.

It is beyond dispute that we can effectively utilize both broadcasting and global adding in the presented parallel back-propagation algorithm. We find it particularly important that the number of communication transactions only depends on hidden layer size. Thereby it is possible to add more processing units to the presented neurocomputer with no change in the number of communication transactions required. This property does not hold for any of the other discussed model architectures.

5.2. Per formance measuremen t s

The performance of the implemented parallel back- propagation algorithm was measured to obtain specific information about the competence of our neurocomputer system. A 1-hidden-layer multilayer perceptron was trained with two different problems. The first problem involves pattern classification on nonlinearly separable pattern classes. The second problem uses the letters of alphabet as training data. The letters were represented as seven-by-nine matrices with binary-valued components.

In order to compare different neural network models and hardware implementations some standard measures are needed. Dedicated performance measures for neural networks have been established, two of which are used here [10]. It is reminded that the results obtained using

H. Klapuri et al./Microprocessors and Microsystems 20 (1996) 267-276 273

Table 1 Time complexity of the parallel back-propagation algorithm is shown. FW and BW refer to the forward and backward passes, respectively. X, Y and H refer to the number of network inputs, outputs and hidden neurons, respectively. P is the number of processing units

Step Time complexity Description

o(7) FW3 O(H log P)

FW4 o ( - H ~ )

Y FW5 O ( ~ + logP)

BW6 O ( - ~ - )

Computation of weighted inputs to hidden neurons takes O ( ~ ) time. Application of an activation function to the weighted inputs is done in O(~) time, which gives the result.

Host reads H hidden layer outputs and each read operation takes O(log P) time. Subsequently, host broadcasts the outputs in O(HlogP) time.

Computation of weighted inputs to output neurons takes O(-~) time. Application of an activation function to the weighted inputs is done in O(~) time, which gives the result.

Processing units compute squared errors in O(r) time. After that, the total error is obtained in O(log P) time.

Computation of Y error terms in P processors takes O(~) time.

Processing units compute the partial sums in O(~) time. Secondly, host obtains the complete sum in O(log P) time and writes it, also in O(log P) time. Lastly, an error term is obtained in O(1) time. This computation is repeated H times.

There are HY output layer weights which are updated in P processors.

There are XH hidden layer weights which are updated in P processors.

these measures are not independent of network model and size, which should always be included as additional information.

The connection updates per second (CUPS) measure evaluates the performance of a training algorithm. It is obtained by training a network with arbitrary data and evaluating the average number of weight updates in a second. The connections per second (CPS) measure is used to evaluate performance in the forward pass. It can be obtained by measuring the number of weighting multiplications performed in a second.

The CPS and CUPS values are distinctly dependent upon network size as demonstrated in Table 3. The best CPS and CUPS values are achieved by using large networks but in terms of execution time it is appropriate to use as small networks as possible. A comparison to other neurocomputers is shown in Table 4.

6. Conclusions

We presented a parallel back-propagation implementation that demonstrates the computing power of a parallel computer based on a tree shape communication network. It was pointed out that a built-in global adder can be efficiently utilized in conjunction with broadcasting capabilities to boost back-propagation training performance. It was also noticed that the global adding capability makes the system more scalable, preventing bottlenecks that often exist in parallel systems.

Our current work includes an enhanced parallel architecture which employs an additional global ring bus to connect the processing units to each other. The ring bus is used for broadcasting while the communication tree is devoted to adding and comparison purposes. We believe that distributed computing systems designed specifically

Table 2 Number of communication bus transactions in a single parallel back-propagation training cycle for fotir different model architectures are shown. H and P refer to the number of hidden neurons and processing units, respectively

Step Communication Communication Communication Communication transactions for transactions for transactions for transactions for Architecture A Architecture B Architecture C Architecture D

FW2 0 0 0 0 FW3 H(P + 1) 2H H(P + 1) 2H FW4 0 0 0 0 FW5 P P 1 1 BWl 0 0 0 0 BW2-4 H(P + 1) H(P + 1) 2H 2H BW5 0 0 0 0 BW6 0 0 0 0

All steps 2H(P + 1) + P H(P + 3) + P H(P + 3) + 1 4H + 1

274 H. Klapuri et al./Microprocessors and Microsystems 20 (1996) 267 276

Table 3 Measured performance of the parallel back-propagation algorithm in terms of connections per second (CPS) and connection updates per second (CUPS)

Input Hidden Output CPS Time/sample CUPS Time/sample nodes neurons neurons in forward pass in training

2 4 1 75 000 161 #s 17 000 717 #s 2 16 1 125 000 385 #s 21 000 2280 #s 2 999 1 159000 18.8ms 23 000 130ms

63 8 3 564000 937 #s 83 000 6370 #s 63 80 3 645000 8190 #s 86000 61.7ms

Table 4 Comparison of the TUTNC neurocomputer to other neurocomputers. Multilayer perceptron performance is shown in terms of million connections per second (MCPS) and million connection updates per second (MCUPS)

Neurocomputer Processing MCPS MCUPS elements

TUTNC 4 0.65 0.086 TUTNC [14] 32 (estimated) 1.9 - BSP400 [8] 400 6.4 MUSIC [9] 63 820 330 CNAPS [11] 512 5700 1460 MM32k [25] 32 768 4.9 2.5 SYNAPSE-I [26] 32 800 CM-5 [27] 544 76

for high-performance applications can achieve excellent performance in comparison with sequential systems.

Acknowledgements

This research work has been supported by the Academy of Finland and the Graduate School on Electronics, Telecommunication, and Automation.

Appendix A: Neurocomputer details

The functional structure of the communication unit is

left arm right arm

I I transfer to/from left and/or right

arm

+ sum left and right arms

compare, t ~ h e n transfer from left

or right arm

bottom ann

Fig. 8. The communication unit enables numerous ways to transfer data between the processing units and the host computer. Transferring data from the host to a single processing unit is writing and transferring data from a single processing unit to the host is reading. Writing to all the processing units at the same time, i.e. broadcasting, is an efficient operation for a tree shape network. In addition, the communication unit is an active processing element and allows comparison and sum- mation of the data being transferred towards the host.

shown in Fig. 8. The physical implementation of the communication network consists of the communication units which are implemented using reconfigurable Xi 1 inx XC4005 field programmable gate array chips, and three buses: the command bus, address bus, and data bus. The command bus is used to deliver commands from the host computer to the communication units and the processing units. The address bus addresses processing units

DSP

RAM ] ~:̧ ̧

Fig. 9. The processing unit is equipped with a digital signal processor (DSP), a control unit (CRU) and 64 kilobytes of local random-access memory (RAM). The digital signal processor was chosen to be Texas Instruments' TMS320C25 running at 40 MHz [28], having a 16-bit fixed point arithmetic signal processor that meets all the prerequisites of low cost, good availability and adequate performance. The control unit is implemented in a single Xilinx XC4005 field programmable gate array chip and it is connected to the communication network, digital signal processor and random-access memory. It controls bus transactions between the processing unit and the communication network.

H. Klapuri et al./Microprocessors and Microsystems 20 (1996) 267-276

in non-broadcast operations. The data bus is a bidirec- tional 16-bit bus which transfers data between the processing units and the host. In addition, the communication network contains separate flag signals which are used for software synchronization purposes. These buses are also visible in Fig. 9, which shows the structure of the processing unit.

The Interface unit connects the neurocomputer to the host computer. The range of computers which can host the neurocomputer does not restrict to any particular computer architecture. The currently implemented interface unit contains the interface protocol needed to connect the neurocomputer to an industry standard architecture compliant bus adapter. The interface unit provides software developers with a set of registers which are mapped into the I/O address space of the host.

In summary, our neurocomputer can be regarded as a master-slave architecture, the host acting as a master and the processing units as slaves. This fact is confirmed by the observation that there is no direct interconnection between the processing units. If a processing unit needs to communicate with another processing unit, it must do it by first sending all data items to the host. Also, all communication network transactions are initiated by the host.

Appendix B: Summary of notations

In the following are summarized the notations used in the sequential and parallel back-propagation algorithms which we present in this paper.

X Number of input nodes. Y Number of output neurons. H Number of hidden neurons. L Number of network layers (hidden and output). x Network input vector consisting of X inputs,

x = x 2 , . . . , xx] T. y Network output vector consisting of Y outputs. t Target vector consisting of Y outputs. h Hidden layer output vector consisting of H out-

puts. wtji Weight which connects the i th neuron or input

node in layer I - 1 to the j th neuron in layer l, with ! = 0 referring to the input layer, I = 1 referring to the hidden layer, and ! = 2 referring to the output layer.

nettj Net input (or activation signal) to the j th neuron in layer I.

o~ Output of the j th neuron in layer 1. For notational convenience let o0j refer to xj and oLj refer to yj.

Nt Number of neurons or input nodes in layer l. Aw Weight change. A p w Previous weight change. g Activation function

275

eli

ot

P

p U p

H p

s u m

Error term of the j th neuron in layer I. Sum of squared errors for the sth training sample. Learning rate parameter. Momentum rate parameter. Number of processing units. pth processing unit. Superscription is used in general .to denote the PU in question. In this case, for instance, H p equalsto the number of hidden neurons stored in P U p .

Sum of products.

References

[1] M. Melton, T. Phan, D. Reeves and D. van den Bout, 'The TInMANN Chip, IEEE Trans. Neur. Netw. 3 (May 1992) 375- 384.

[2] J. Tomberg and K. Kaski, Some IC implementations of artificial neural networks using synchronous pulse-density modulation technique, Int. J. Neur. Syst. 2 (1991) 101-114.

[3] 80170NX Electrically Trainable Analog Neural Network, Data Sheets, Intel corp., CA, USA (1991).

[4] M.P. Perrone and L.N. Cooper, the Nil000: high speed parallel VLSI for implementing multilayer perceptrons, in J.D. Cowan, G. Tesauro and J. Alspector (eds) Advances in Neural Information Processing Systems, Vol. 8, Morgan Kaufmann, San Mateo, CA, 1995.

[5] R. Hecht-Nielsen. Neurocomputing: Picking the human brain IEEE Spectr. 25 (1988) 36-41.

[6] Delta/Sigma/ANSim, Editorial, Neurocomputing 2 (1988). [7] R. Hecht-Nielsen, Neurocomputing, Addison-Wesley, MA, USA,

1990. [8] J. Heemskerk, J. Hoekstra, J. Murre, L. Kemna and P. Hudson.

'The BSP400: a modular neurocomputer, Microprocessors Micro- syst. 18 (March 1994) 67-78.

[9] U. Miiller, A. Gunzinger and W. Guggenbiihl, Fast neural net simulation with a DSP processor array, IEEE Trans. Neur. Netw. 6 (January 1995) 203-213.

[10] T. Nordstrrm, Doctoral Thesis, Division of Computer Science and Engineering, Lule~ University of Technology, Sweden, 1995.

[11] D. Hammerstrfm, A highly parallel digital architecture for neural network emulation in J. Delgado-Frias and W. Moore, (eds) VLSI for Artificial Intelligence and Neural Networks Plenum, New York, 1990.

[12] U. Ramacher, W. Raab, J. Anlauf, J. Beichter, U. Hachmann, N. Bruls, M. Wesseling, E. Sicheneder, R. M/inner,, J. Gl/iss and A. Wurtz, Multiprocessor and memory architecture of the neurocomputer SYNAPSE-l, in Proc. 3rd lnt. Conf. on Microelectronics for Neural Networks, MICRONEURO-93, Edinburgh, Scotland, April 1993, pp. 227-231.

[13] J. Makhoul, A. El-Jaroudi and R. Schwartz, Formation of dis- connected decision regions with a single hidden layer, in Proc. Int. Joint conf. on Neural Networks, Washington, DC, Vol 1, 1989, pp. 455-460.

[14] T. H/im/il/iinen, J. Saarinen and K. Kaski, 'TUTNC: A general purpose parallel computer for neural network computations, Microprocessors Microsyst. 19 (1995) 447-465.

[15] MPI: A Message-Passing Interface Standard, Message Passing Interface Forum, University of Tennessee, USA, 1994.

[16] TMS320C2X/C5X Optimizing C Compiler User's Guide, Texas Instruments Inc., USA, 1991.

[17] Borland C++v4.02 Documentation, Borland International Inc., USA 1993.

[18] P.J. Werbos, PhD Thesis, Harvard University, MA, USA, 1974.

276 H. Klapuri et al./Microprocessors and Microsystems 20 (1996) 267-276

[19] D.E. Rumelhart and J.L. McClelland, (eds) Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol 1, MIT Press, MA, USA, 1986.

[20] S. Haykin, Neural Networks." A Comprehensive Foundation, Macmillan College Publishing, NY, USA 1994.

[21] J. Freeman and D. Skapura, Neural Networks Algorithms, Appli- cations, and Programming Techniques, Addison-Wesley, MA, USA 1991.

[22] H. Klapuri, Software development for a general-purpose neurocomputer, 1995, Report 4-95 (Electronics Laboratory, Tampere University of Technology, Finland).

[23] P. Kotilainen, J. Saarinen and K. Kaski, A multiprocessor architecture for general-purpose neurocomputing applications, in Proc. 3rd Int. Conf. on Microelectronics for Neural Networks, MICRO- NEURO-93, Edinburgh, UK, April 1993, pp. 35-45.

[24] H. Yoon, J. Nang and S. Maeng, Parallel simulation of multi- layered neural networks on distributed-memory multiprocessors, Microprocessmg Microprog. 29 (October 1990) 185-195.

[25] MM32k Preliminary Documentation, Current Technology Inc, USA, 1995.

[26] SYNAPSE-1 Model N110 Technical Description, Siemens Nixdord Inc, Germany, 1995.

[27] X. Liu and W. Wilcox, Benchmarking of the CM-5 and the Cray machines with a very large back-propagation neural network, in Proc. IEEE Int. Conf. Neural Networks, Orlando, Florida, USA, 1994, pp. 22 27.

[28] TMS320C2X User's Guide, Texas Instruments Inc, USA, 19~)3.

Harri Klapuri was born in Finland on October 4, 1970. He studied software systems, computer engineering and mathematics in the Informa- tion Engineering Department at Tampere Uni- versity of Technology, where he received an MSc degree in 1995. He is currently working in the Electronics Laboratory at TUT. His research interests include parallel processing, compilers and programming languages.

Timo HiJmi~liiinen was born in Finland on June 13, 1968. He studied analogue and digital electronics, computer architecture and power electronics in the Electrical Engineering Department at Tampere University of Technology, where he received an MSc degree in 1993. He is currently working in the Electronics Laboratory at TUT. His PhD research concerns parallel processing of adaptive, intelligent algorithms in a special multiprocessor computer architecture.

Jukka Saarinen was born in Finland on July 11, 1961, He studied computer architecture, digital techniques, telecommunications and software engineering in the Electrical Engineering Department at Tampere University of Technol- ogy, where he received an MSc degree in 1986, a Licentiate of Technology degree in 1989, and a Doctor of Technology degree in 1991. Cur- rently he is Professor of Computer Engineering at Tampere University of Technology. His research interests are parallel processing, neural networks, fuzzy logic and pattern recognition.

Kimmo Kaski was born in Finland on April 20, 1950. He received an MSc degree in 1973 and a Licentiate of Technology degree in 1977 from the Department of Electrical Engineering at Helsinki University of Technology, Finland. He finished his PhD degree in the Theoretical Physics Department at Oxford University in 1981. Currently he is Professor of Electronics at Tampere University of Technology, Tampere, Finland. He is acting as research leader in parallel processing hardware and software, neural network modelling and computational physics.

Documents

Mapping artificial neural networks to a tree shape neurocomputer