16
Neural Processing Letters 7: 169–184, 1998. © 1998 Kluwer Academic Publishers. Printed in the Netherlands. 169 Parallel Coarse Grain Computing of Boltzmann Machines JULIO ORTEGA, IGNACIO ROJAS, ANTONIO F. DIAZ and ALBERTO PRIETO Departamento de Arquitectura Tecnología de Computadores, Universidad de Granada E-mail: [email protected] Abstract. The resolution of combinatorial optimization problems can greatly benefit from the par- allel and distributed processing which is characteristic of neural network paradigms. Nevertheless, the fine grain parallelism of the usual neural models cannot be implemented in an entirely efficient way either in general-purpose multicomputers or in networks of computers, which are nowadays the most common parallel computer architectures. Therefore, we present a parallel implementation of a modified Boltzmann machine where the neurons are distributed among the processors of the multicomputer, which asynchronously compute the evolution of their subset of neurons using values for the other neurons that might not be updated, thus reducing the communication requirements. Several alternatives to allow the processors to work cooperatively are analyzed and their performance detailed. Among the proposed schemes, we have identified one that allows the corresponding Boltz- mann Machine to converge to solutions with high quality and which provides a high acceleration over the execution of the Boltzmann machine in uniprocessor computers. Key words: Boltzmann machine, combinatorial optimization, multicomputers and networks of com- putes, NP-complete problems, parallel processing 1. Introduction Multicomputers and Networks of Workstations (NOWs) have become the most widely extended parallel computer architectures because they represent a good choice in terms of cost/performance and scalability. In these architectures, each processor has its own local memory and contacts with the other processors of the system through an interconnection network or a local area network, thus corre- sponding to a coarse grained architecture with the memory distributed among the processing nodes. As the cost of communicating the processing nodes is high, it is necessary to have a great volume of computation being performed between subsequent communications in order to achieve appropriate efficiency. The problem addressed in this paper is that of providing an effective neural par- adigm which would be well suited to take advantage of these distributed memory architectures for searching and optimization problems. Although neural networks process the information in a distributed and massively parallel way, the parallelism offered corresponds to a fine grain parallelism due to the great requirements of communication and synchronization they pose. Thus, it can be argued that neural

Parallel Coarse Grain Computing of Boltzmann Machines

Embed Size (px)

Citation preview

Neural Processing Letters7: 169–184, 1998.© 1998Kluwer Academic Publishers. Printed in the Netherlands.

169

Parallel Coarse Grain Computing of BoltzmannMachines

JULIO ORTEGA, IGNACIO ROJAS, ANTONIO F. DIAZ and ALBERTOPRIETODepartamento de Arquitectura Tecnología de Computadores, Universidad de GranadaE-mail: [email protected]

Abstract. The resolution of combinatorial optimization problems can greatly benefit from the par-allel and distributed processing which is characteristic of neural network paradigms. Nevertheless,the fine grain parallelism of the usual neural models cannot be implemented in an entirely efficientway either in general-purpose multicomputers or in networks of computers, which are nowadaysthe most common parallel computer architectures. Therefore, we present a parallel implementationof a modified Boltzmann machine where the neurons are distributed among the processors of themulticomputer, which asynchronously compute the evolution of their subset of neurons using valuesfor the other neurons that might not be updated, thus reducing the communication requirements.Several alternatives to allow the processors to work cooperatively are analyzed and their performancedetailed. Among the proposed schemes, we have identified one that allows the corresponding Boltz-mann Machine to converge to solutions with high quality and which provides a high accelerationover the execution of the Boltzmann machine in uniprocessor computers.

Key words: Boltzmann machine, combinatorial optimization, multicomputers and networks of com-putes, NP-complete problems, parallel processing

1. Introduction

Multicomputers and Networks of Workstations (NOWs) have become the mostwidely extended parallel computer architectures because they represent a goodchoice in terms of cost/performance and scalability. In these architectures, eachprocessor has its own local memory and contacts with the other processors of thesystem through an interconnection network or a local area network, thus corre-sponding to a coarse grained architecture with the memory distributed among theprocessing nodes. As the cost of communicating the processing nodes is high,it is necessary to have a great volume of computation being performed betweensubsequent communications in order to achieve appropriate efficiency.

The problem addressed in this paper is that of providing an effective neural par-adigm which would be well suited to take advantage of these distributed memoryarchitectures for searching and optimization problems. Although neural networksprocess the information in a distributed and massively parallel way, the parallelismoffered corresponds to a fine grain parallelism due to the great requirements ofcommunication and synchronization they pose. Thus, it can be argued that neural

170 JULIO ORTEGA ET AL.

paradigms are more suited for fine grain computer architectures, rather than thedistributed memory architectures we are interested in. Nevertheless, here we showthat efficient implementations of neural models in general-purpose multicomputersare possible.

The neural model considered in this paper is the Boltzmann machine (BM) [1]which has been widely studied and applied to combinatorial optimization prob-lems. As its high processing cost might limit its applicability to real problems,the use of parallelism in an efficient way could be quite beneficial. As is wellknown, the BM is a stochastic feedback neural network consisting of binary neu-rons connected by symmetric weights. If the Boltzmann machine has N neurons,the weights associated with their interconnections define an N×N matrix, calledthe weight matrix, in which the element W(i,j) (i,j=1,2,..,N) corresponds to theconnection between the neurons i and j. The state of the neuron i is indicatedas S(i) (i=1,...,N) and can take the values 0 or 1. The set of values for all theneurons in the machine, (S(1),S(2),...,S(N)), is calledconfigurationof the BM,S[m], where m = 2◦S(1) + 21S(2) + .. + 2N−1S(N). With a BM, it is possible toassociate aconsensus function, CS[m], in order to evaluate the overall desirabilityof the activated connections in a given configuration and it is defined as

CS[m] =N∑i=1

∑j≥1

W(i, j) S(i) S(j) (1)

Given the configuration S[m], the difference in the consensus, dC(i), when thestate of neuron i changes while the states of the remaining neurons are unchangedis given by

dC(i) = (1− 2S(i))

∑j+i

W(i, j) S(i)+W(i, i) (2)

and the probability of accepting this state transition in neuron i is

AT (dC(i)) = 1

1+ e−dC(i)T

(3)

where T denotes the value of a control parameter usually called temperature. Asonly one neuron is allowed to change its state at a given configuration, the BMimplements a local optimization procedure in which the neighbourhood of a givenconfiguration (solution) is the set of configurations (solutions) at a Hamming dis-tance of one.

Equations (1) to (3) describe the ‘heathbath’ algorithm for a BM. It is also pos-sible to use the Metropolis algorithm, in which a change in a neuron with dC(i)>

0 is always accepted, whereas a change with dC(i)> 0 is accepted with probabilityAT (dC(i)) = exp (dC(i)/T). In a minimization problem, a change with dC(i)< 0

PARALLEL COARSE GRAIN COMPUTING OF BOLTZMANN MACHINES 171

is always accepted, whereas if dC(i)< 0, the change is accepted with probabilityAT (dC(i)) = exp (–dC(i)/T).

The BM has been implemented in both general-purpose [2] and specific-purposeparallel architectures [3,4]. They are able to speed up the evolution of the BM withrespect to the number of neurons, but due to the synchronization and communi-cation requirements, the use of a high number of processors is not efficient wheneither the neurons are highly interconnected or it is not possible to find an ade-quate clustering of neurons that allows the reduction of communications among theprocessors where the BM has been distributed. Thus, it would be very convenient tohave a neural paradigm with low communication and synchronization requirementsin order to take advantage of the availability of general-purpose architectures. Thisgoal is as that of works like [8], which analyzes the effect of reducing communi-cations in a parallel implementation of the simulated annealing algorithm, or [10],which considers the possibility of accelerating, by a factor greater than the numberof processors used, the obtention of sufficiently good solutions for the NP-completeproblems.

In the next section, the model proposed for parallel implementing the BM isdescribed. The idea is to distribute the neurons among the processors. As eachprocessor computes the changes in its subset of neurons while considering the restof neurons to be clamped, the set of processors searches in different and smallerzones of the solution space in order to find a local optimum. The solution foundis communicated by each processor to the others, and any processor receiving thisinformation uses it to guide its search within new subspaces where better solutionscould be found. Thus, the procedure implemented by each processor can be alsoseen as a local search heuristics mixed with transitions to non-neighbour configura-tions. This method can be included in the class of large-step optimization methods[5,6], which are defined by a procedure to perform the local search, a procedure toperform the large-step transitions to non-local solutions, and an accept/reject test.In the present case, the use of several processors and the characteristics of the BMwould make it possible to exploit the work done by the remote processors in orderto drive the large-step transitions of each processor to speed up the search. Differentoptions to allow the processors to work cooperatively through interactions amongthem are presented in Section 2, and their performance is analyzed in Section 3.Finally, Section 4 gives some conclusions.

2. Parallel evolution of Boltzmann Machines

In this section we present the parallel BM implementation proposed to allow anefficient resolution of the corresponding optimization problem in a general-purposemulticomputer or in a NOW whilst maintaining the convergence towards suffi-ciently good solutions. The neurons of the BM are distributed among the P proces-sors (k=1,2,..,P) as shown in Figure 1a using as an example a BM with eight

172JU

LIO

OR

TE

GA

ET

AL

.

Table I. Update Rules for Step 2 of parallel evolution (Figure 2).

Rule Description Comments

UR1 For all neurons i of processor q do Sk(i) = Sq(i); Direct Assignment

UR2 dCq,k = CSq[mq]· CSk[mk] Comparing the consensus of the local configuration, Sk[mk],

if ((dCq,k < 0) or (random()< AT(dCq,k))) and the consensus of the remote configuration Sq[mq]

then for all neurons i of processor q do Sk(i) = Sq(i);

UR3 For j: = 1 to Llocal do Evaluation of the change in the local consensus of processor

begin Random_Selection (neuron i of remote processor q); k due to a transition in neuron i (belonging to remote

/∗ is a neuron of q such that Sq(i) 6= Sk(i) ∗/ processor q). All the weights connecting the remote neuron

dCk,q(i) = (1− 2Sk(i))

(∑j 6=i

W(j, i)Sk(j)+W(i, i))

i with other neurons are required.

if ((dCq,k(i) < 0) or (random()< AT(dCq,k(i)))) then Sk(i) = 1−Sk(i);

end

UR4 For j: = 1 to Llocal do Estimate of the change in the local consensus in processor k

begin Random_Selection (neuron i of remote processor q); using only the weights connecting neurons of k to neuron i

/∗ is a neuron of q such that Sq(i) 6= Sk(i) ∗/ (belonging to remote processor q)

dCk,q(i) = (1− 2Sk(i))

( ∑j∈Pk

W(j, i)Sk(j)

)if ((dCq,k(i) < 0) or (random()< AT(dCq,k(i)))) then Sk(i) = 1−Sk(i);

end

PARALLEL COARSE GRAIN COMPUTING OF BOLTZMANN MACHINES 173

Figure 1. (a) Distribution of the neurons among the processors, and (b) functional blocksimplemented by a given processor.

neurons and four processors. The neurons associated with processor k are calledneurons of k. Nevertheless, to compute the changes in its neurons (using (2)),a processor also needs the values of all the neurons connected to it. Thus, eachprocessor k stores in its local memory alocal configurationdenoted as Sk[mk] =(Sk(1),Sk(2),..,Sk(N)), where Sk(j) is the state of neuron j in the local configurationof processor k. In this way, the local configuration of processor k has two kindsof components, those corresponding to theneurons of kand those correspondingto the neurons assigned to the remaining processors, called the remote neurons ofk. When required, the components of a given local configuration are noted withsuperindices. Thus, Sq

k(i) (q = 1,2,..,P) refers to thestate, in the local configurationof processor k, of the neuron i (which is a neuron of processor q). If q = k, Sk

k(i)is the state of neuron i, which is one of the neurons of processor k, otherwise i isa remote neuron of k. As we will see, the values Sq

k(i) at a given instant do notnecessarily coincide with Sqq(i). Thus they are not necessarily updated.

The P processors implementing the evolution of the parallel Boltzmann ma-chine interact at some instants and alternate two processing phases or steps, asindicated in the algorithm shown in Figure 2, whose functional blocks are providedin Figure 1b. In the first phase (Step 1), each processor k evolves (as a sequentialBoltzmann machine) by changing in the local configuration, Sk[mk], only its subsetof neurons (values Skk(i)) and keeping its remote neurons (values Sq

k(i), with k 6=q) clamped and acting as parameters. In this way, in Step 1 each process searcheswithin a reduced subspace defined by the clamped states of its remote neurons.

174 JULIO ORTEGA ET AL.

Figure 2. Evolution of the proposed Parallel Boltzmann Machine.

In the second phase (Step 2), the processor receives remote configurations fromother processes and interacts with them through one of the Update Rules (UR1to UR4) shown in Table I. The Update Rule defines the way in which the remoteneurons of k (values of Sqk(i), with k 6= q, in Sk[mk]) are modified according tothe states of the neurons of Sq[mq], which come from processor q. In this step, thelocal neurons in Sk[mk] are clamped while the remote neurons in Sk[mk] changeaccording to their state in the remote configuration and the Update Rule used. Eachprocessor takes advantage of the search carried out by the remote processors in theirStep 1. Thus, the evolution of the Boltzmann machine in each process alternatestwo clamping phases. In [7], the behaviour of a Boltzmann machine when someneurons are clamped to a fixed state is analyzed. As indicated there, the effect ofclamping is only to change the probability of selecting the neuron that evaluates achange in its state. Once the neuron has been chosen, the probability AT(dCk(i))determines the value of the acceptance probability as in the sequential Boltzmannmachine in which all N neurons evolve without clamping.

The interaction among configurations, which takes place in a given processoraccording to the Update Rule used, should produce a diversification of the local

PARALLEL COARSE GRAIN COMPUTING OF BOLTZMANN MACHINES 175

configurations in order to drive each processor to a different solution subspace,which is explored afterwards in Step 1. The Update Rules of Table I correspondto some direct alternatives according to the usual evolution of a sequential Boltz-mann machine (other different URs can be also defined). Their performance ispresented in Section 3. In UR1, the transition is always accepted and the statesof the neurons Sqk(i) in Sk[mk] are updated with the corresponding values in theremote configuration Sq[mq] coming from processor q. In UR2, it is supposed thateach processor also computes the consensus of its respective local configurations.Thus, each processor sends to the remote processors the value of its consensustogether with the corresponding local configuration, and the transition is taken ornot according to the increment in the consensus function between the local and thereceived configurations. In UR3 and UR4, the local configuration of processor kand the remote configuration coming from a remote processor, for example fromprocessor q, are compared in processor k. This processor considers transitionscomprising changes only in one of its remote neurons belonging to processor q.While algorithm UR3 uses the correct value of the increment in the consensus, UR4estimates this increment by using only the weights connecting the local neurons ofthe processor to the rest of the neurons. Thus, the weights connecting the neuronsof remote processor q to the neurons of other processors different from processork are considered as zero. This also means a high memory saving with respect toUR3, that needs to store the whole weights matrix in the local memory of eachprocessor.

The effects of the Update Rules of Table I are illustrated in Figure 3a and3b. These figures correspond to a BM with six neurons distributed between twoprocessors, P1 and P2, and show the possible subspaces explored by processor P1after receiving the configuration (101 100) from processor P2. The algorithms UR1and UR2 produce transitions in P1 of the type represented in Figure 3a, while thetransitions in P1 due to algorithms UR3 and UR4 are shown in Figure 3b. Furtherdifferences between UR1 and UR2, and between UR3 and UR4 correspond to thecriterion applied to accept the transition (described in Table I).

In previous parallel BM implementations [2], the values of the remote neu-rons are simply updated in the local configuration of a given process k whenevera remote configuration is received. Just this possibility corresponds to UR1 inTable I. In the next section it is shown that, supposing the same volume of com-munication, UR1 provides worse performances than other UR’s. Nevertheless, itwould be possible to improve the quality of the solution by reducing the timebetween communications. For the simulated annealing algorithm, there are studieswhich relate the number of trials without updated information (stream length) tothe tolerable cost error to reach a sufficiently good solution [8]. The estimatedvalue of the stream length for a given error level could be useful to determine thecommunication needs.

In the scheme here proposed each processor performs an optimization processin the subspace defined by the states of its remote neurons and the Update Rule

176 JULIO ORTEGA ET AL.

Figure 3. Possible transitions in the local configuration of P1 due to interactions between P1and P2 through different UR’s: (a) UR1/UR2; (b) UR3/UR4.

decides the changes to be carried out according to the data coming from remoteprocessors. The parallel BM convergence properties and performance shown inSection 3 can be explained by considering that it implements a search with multipleinteracting Markov chains (one per processor) thus reducing the probability of thesolution being trapped in a local optimum, as is also shown in some asynchronousparallel implementations of simulated annealing [11].

3. Experimental Results

Table II shows the experimental results obtained and compares the sequential im-plementation of a BM and the parallel implementations corresponding to the differ-ent UR’s of Table I. The results have been obtained by running a simulator whichallows the selection of different numbers of processors and parameter values. Ineach example of BM considered, all the simulations have been executed with thesame values for the parameters that control the annealing process for all UR’s.The number of interactions in Step 2 (Inter) has been fixed to be equal to P-1 (thenumber of remote processors).

The columns of Table II correspond to five Boltzmann machines, termed asRND1, RND2, VC1, VC2, VC3, all with 1024 neurons. The weights of RND1and RND2 are real numbers which have been randomly selected, between−10.0

PARALLEL COARSE GRAIN COMPUTING OF BOLTZMANN MACHINES 177

Table II. Experimental results for several Boltzmann Machines (% Avg.Error/Niter ). (Boldentries correspond to the smallest % Avg. Error and the smallestNiter in each BM.)

BM Sch RND1 RND2 VC1 VC2 VC3

SEQ (1) 0.17/52 0.30/39 0.01/72 0.26/72 0.38/72

UR1 (8) 0.38/61 0.47/49 0.01/72 0.00/72 0.00/74

(16) 6.15/71 0.77/63 0.01/72 0.00/72 0.00/72

(32) 11.35/>104 7.49/72 0.01/72 0.00/72 0.00/73

(64) 15.08/>104 17.26/>101 0.01/72 0.00/72 0.00/72

UR2 (8) 1.20/14 0.95/14 0.01/12 0.00/17 0.00/15

(16) 2.49/34 1.18/24 0.01/27 0.00/25 0.00/27

(32) 3.36/59 3.13/46 0.01/40 0.00/38 0.00/41

(64) 7.24/91 6.59/80 0.01/64 0.00/62 0.00/61

UR3 (8) 0.76/51 0.65/38 0.01/68 0.00/69 0.00/70

(16) 0.56/51 1.34/38 0.01/69 0.00/69 0.00/69

(32) 0.44/51 1.04/39 0.01/69 0.00/69 0.00/69

(64) 0.47/51 1.03/38 0.01/69 0.00/69 0.00/69

UR4 (8) 0.61/6 0.69/7 0.01/4 0.00/4 0.00/4

(16) 0.78/7 0.54/6 0.01/3 0.00/4 0.00/4

(32) 0.44/6 1.02/7 0.01/4 0.00/4 0.00/4

(64) 0.01/4 0.00/4 0.00/4

0.52/6 0.59/7

BestSol −39604.12 −111758.03 −149322.09 −300921.47 −513142.63

and 10.0 for RND1, and between−20.0 and 20.0 for RND2. These BM’s cor-respond to strongly connected BM’s because each neuron has non-zero weightsin the connections to the other neurons. The weights in VC1, VC2, and VC3 aredefined, according to [9], to solve a minimum vertex cover problem for three dif-ferent graphs whose associated BM’s have different levels of connectivity. Thus,VC1 corresponds to a graph with a probability of 0.25 for a connection betweentwo given graph nodes, VC2 to a probability of 0.50, and VC3 to a probabilityof 0.85. In the table, the row SEQ corresponds to the sequential (conventional)implementation of each BM, and rows UR1 to UR4 give the results for each URdescribed in Table I, for different numbers of processors. The best consensus valueobtained for each BM is given in the row BestSol (Best Solution). In row SEQ, foreach BM considered, the average% error with respect to the best solution found,

178 JULIO ORTEGA ET AL.

Figure 4. Evolution to the final solution of different UR’s with 16 processors compared withSEQ (in RND1).

and the average number of iterations needed to provide a solution with 10% error inits consensus with respect to BestSol (Niter ) is provided. In each row URi, the samedata as for row SEQ are given but considering a different number of processors (8,16, 32, and 64 processors).

Figure 4 compares the consensus and the number of iterations required to getthe final solution for SEQ, and for UR1 to UR4 (with 16 processors), in the caseof the Boltzmann machine RND1. The curves for the different UR’s correspondto the evolution of one of the 16 processors, which has been randomly selected.The convergence curves for the other BM’s present similar characteristics to thoseshown in this figure.

From Table II it is possible to draw some conclusions. First, it is clear that forboth sequential and parallel schemes, the performances are worse for Boltzmannmachines RND1 and RND2 than for Boltzmann machines VC1 to VC3. In RND1and RND2 every neuron is connected with all the other neurons, and the weightscan take any value in a continuous range. In any case, the errors are quite smallin almost all the experiments. The convergence for VC1, VC2, and VC3, is verygood in all the parallel schemes. As shown in Table II, the parallel schemes wereeven able to find better solutions for VC2 and VC3 than those obtained by theconventional algorithm (SEQ).

In scheme UR4,Niter is very small in comparison with that corresponding to thesequential execution of the BM and, what is more important,Niter does not changewhen the number of processors increases. Moreover, the solutions obtained by

PARALLEL COARSE GRAIN COMPUTING OF BOLTZMANN MACHINES 179

using UR4 are quite similar to or even better than those obtained by the sequentialexecution of the corresponding BM’s (row SEQ).

The other interaction schemes for the parallel Boltzmann machine, i.e. UR1,UR2, and UR3, are worse than UR4. In general, the quality of the solutions ob-tained by these parallel schemes is slightly worse than that of the sequential ex-ecution, and also,Niter is similar to or even worse than that of the sequentialexecution. For example,Niter of UR1 for RND1 and RND2 grows with the numberof processors and is higher than the number of iterations required by a sequentialexecution of RND1 and RND2. In UR1 for VC1, VC2, and VC3, and in UR3, thenumber of iterations does not change with the number of processors but its valueis similar to the number obtained by a sequential execution. Finally, the numberof iterations for UR2 grows with the number of processors and for more than 32processorsNiter is higher than that of the corresponding sequential execution.

The behaviour ofNiter when the number of processors increases can be ex-plained using Figures 3.a and 3.b. As the number of processors grows, the numberof neurons assigned to a processor (i.e. N/P) decreases and the number of subspacesdefined by the clamped neurons in Step 1 grows, because it is equal to 2(N−N/P).In schemes UR1 and UR2, considering (as in the experiments) that each processorinteracts with each of the remaining processors only once per iteration, and aseach interaction produces only one transition towards a new subspace (it growsas P), the proportion ofpossible subspaces over explored subspacesalso grows.Thus, in UR1 and UR2, as the number of processors is increased, more iterationswould be required to explore a similar size of the solution space. However, in UR3and UR4,Niter does not change with the number of processors. In these cases, asshown in Figure 3b, each processor might have transitions up to 2(N−N/P) differentsubspaces. This number grows as fast as the number of possible subspaces, and therate ofpossible subspaces/explored subspacesremains approximately constant.

Comparing UR3 and UR4, the best performances in speed corresponds to UR4.Moreover, as UR4 only uses the weights for connections between the local neuronsof a given processor and the remote ones, each processor needs to store less valuesof weights in its local memory. The differences between UR3 and UR4 can bejustified from Figure 5. This compares, considering RND1 and 32 processors asexample, the number of processors with different values of consensus in each itera-tion (Figure 5a), the standard deviation of the consensus at the end of each iteration(Figure 5b), and the average of the cross correlation coefficients corresponding tothe evolution of the consensus in each processor (Figure 5c). As can be seen, thestandard deviation is lower for UR4 and decreases faster than the standard deviationof UR3. The number of processors with different consensus is equal to the numberof processors (32 in this case) while the consensus is far from the optimal valueobtained. Furthermore, the average of the cross correlation coefficients among theconsensus sequences obtained in each processor is close to one for UR4, whilst inthe case of UR3 it slowly tends to one when the number of iterations grows. Thus,UR4 allows the processors to evolve in a more efficient way, exploring diverse

180 JULIO ORTEGA ET AL.

areas of the solution space. Furthermore, the cooperation among processors, whichis clear from the correlation between the consensus trajectories, keeps the solutionsobtained by each processor close to one another, thus exploring solutions near tothe best solution found at any moment.

In this way, UR4 provides good solutions with a reduced number of iterations,thus being able to reach a significant speedup with respect to the sequential exe-cution. UR1 and UR2, although giving good solutions, do not provide an efficientuse of a parallel machine because they are not able to speed up the obtention of thesolution when the number of processors increases and the quality of the solution isalso degraded in some cases. The values ofNiter provided by Table II show that theconvergence speed in a processor does not change when the number of processorsinteracting through UR3 or UR4 grows.

The speedup (S) that may be attained with this scheme can be evaluated fromthe following expression

S = T1

TP= N1

iter · T 1step1(N)

NPiter ·

[T Pstep1N,P )+ T Pstep2(N, P )

] (4)

where T1step1(N) is the time required by each iteration in the sequential execution

(SEQ) of a Boltzmann machine with N neurons; Tpstep1(N,P) is the time per iter-

ation of Step 1 of the parallel scheme when its N neurons are distributed amongP processors; and Tp

step2(N,P) is the time required by Step 2. The valuesN1iter and

Np

iter correspond, respectively, toNiter for the sequential and the parallel schemes.The value of T1step1(N) is proportional to the length of the Markov chain, L,

used at each temperature. In our case, we have used L=N as in [1], so T1step1(N) =

N · tcomp, with tcomp being the time required to compute a transition in a neuron.The value of Tpstep1(N,P) is equal to the product Llocal· tcomp, and as Llocal has beenassumed to be proportional to (N/P), Llocal = Kstep1· (N/P), thus Tpstep1 = Kstep1·(N/P) · tcomp. Finally Tp

step2(N,P) depends on the time required to communicate theinteracting processors, and on the time required to compute the interaction, whichis also considered proportional to N/P. Thus, Tp

step2(N,P) = Inter(P)· {(N/P) · tcomp

+ tcomm(P)}, where Inter(P) is the number of processors interacting in Step 2. Inthis way,

Tpstep2(N,P) = (N/P) · tcomp · Kstep2(N,P)

where Kstep2(N,P) = (1 + (P/N)· (tcomm(P)/tcomp)) · Inter(P), and the efficiency (E =S/P) can be expressed as:

E = S

P= T1

PTP= N1

iter

NPiter

1

Kstep1(N, P )+Kstep2(N, P )(5)

Whenever the communication can be overlapped with the computation, the valueof Kstep2can be considered as approximately equal to Inter(P). Thus, for schemes

PARALLEL COARSE GRAIN COMPUTING OF BOLTZMANN MACHINES 181

Figure 5. Comparison of UR3 and UR4 using 32 processors in the execution of RND1. (a)Number of Processors with different consensus values vs. iterations. (b) Standard deviation,and (c) average of cross correlation coefficients among consensus sequences in differentprocessors (vs. iterations).

182 JULIO ORTEGA ET AL.

Figure 6. Efficiencies obtained in the PARAMID multicomputer.

such as UR4, where N1iter/Np

iter does not change with the number of processors P,if the complexity of Kstep1and Kstep2is less than P, the speedup obtained increaseswith the number of processors. Moreover, whenever (Kstep1+ Kstep2) < (N1

iter/Np

iter)is verified, efficiencies higher than 1 are obtained (corresponding to superlinearspeedups). If Inter(P) is set proportional to the number of processors, the speeduptends to N1

iter/Npiter as P grows.

Figure 6 shows the average efficiencies experimentally obtained by running dif-ferent Boltzmann machines (with UR4) that correspond to the vertex cover problemapplied to randomly selected graphs of up to 256 nodes. For clarity, Figure 6 onlyprovides the results for N = 64 and N = 256. The experiments have been carried outin a PARAMID multicomputer, with nodes based on the TTM200 board equippedwith an Intel i860/XP processor, 16 Mbytes of memory, and a T805 Transputerwith four bidirectional links and 4 Mbytes of memory. Up to eight processors wereavailable for our experiments. As shown in Figure 6, it is possible to obtain efficien-cies higher than one (superlinear speedups). The parameters of the above describedspeedup model have been determined by using the expression (5) to approximatethe experimental data. Thus, N1

iter/Npiter = 7.75, Kstep1 = 2.25, and Kstep2 (N,P) =

0.65(P−1) have been obtained for N = 256; and N1iter/N

piter = 5.0, Kstep1= 2.60, and

Kstep2(N,P) = 0.35(P−1) for N = 64. For N = 64 and P = 2, the experimental datapresent a high deviation with respect to the expression (5). This can be explainedtaking into account that, as the number of processors is low, the effect of coopera-tion among processors also decreases, with a corresponding reduction in N1

iter/Npiter

for this small number of neurons.

PARALLEL COARSE GRAIN COMPUTING OF BOLTZMANN MACHINES 183

4. Conclusions

A new scheme for an efficient implementation of Boltzmann machines in coarsegrain parallel computer architectures has been presented. Each processor alternatestwo computation phases (Step 1 and Step 2). In Step 1, the processor improvesthe consensus of the Boltzmann machine by only changing the states of the neu-rons assigned to it while its remote neurons are considered to be clamped. Thus,these clamped neurons act as parameters or constraints, which are updated in Step2 through interactions between the processors according to an Update-Rule thatguides the evolution by taking into account the work done by the remote processorsin the corresponding Step 1. We have considered different Update Rules (UR1 toUR4), corresponding to different kinds of interactions between processors. Thebest performance in speed, for high-quality solutions, have been observed whenusing UR4, in which each processor only considers the weights connecting its localneurons with the rest, in order to determine the transitions in Step 2.

In this way, the present paper has set out the possibility of using models based onneural networks to obtain a cooperative behaviour in present-day general-purposeparallel computers, with their limitations due to communication and synchroniza-tion overheads. These procedures allow each processor proceeds its search in adifferent subspace and to take advantage of the search performed by the otherprocessors, in order to drive its search towards a better solution and to obtain itin less time. Thus, it would be possible to speed up some NP-complete problemsby a factor greater than the number of processors used. Recently, this idea of over-coming the problems posed by the exponential complexity of a given optimizationproblem has also been considered, from a different point of view, in other papers,such as [10].

Acknowledgements

This paper has been partially supported by project TIC94-0506 (CYCIT, Spain).We would like to thank the Department of Computer Architecture of the Universityof Málaga (Spain) for allowing us to use their PARAMID multicomputer. We alsowant to thank the referees for their suggestions.

References

1. E.H.L. Aarts and J.H.M. Korst, Simulated Annealing and Boltzmann Machines, New York:Wiley, 1988.

2. D.H. Oh, J.H. Nang, H. Yoon and S.R. Maeng, “An efficient mapping of Boltzmann Machinecomputations onto distributed-memory multiprocessors”, Microprocessing and Microprogram-ming, 33, pp. 223–236, 1991/92.

3. A. De Gloria, P. Faraboschi and S. Ridella, “A dedicated Massively Parallel Architecture forthe Boltzmann Machine”, Parallel Comp., Vol. 18, No. 1, pp. 57–75, 1993.

184 JULIO ORTEGA ET AL.

4. A. De Gloria and M. Olivieri, “An asynchronous distributed architecture model for theBoltzmann machine control mechanism”, IEEE Trans. on Neural Networks, Vol. 7, No. 6.,pp. 1538–1541. November, 1996.

5. O. Martin, S.W. Otto and E.W. Felten, “Large-step Markov chains for the TSP incorporationlocal search heuristics”, Operations Research Letters, 11, pp. 219–224. May, 1992.

6. H.R. Lourenço, “Job-shop scheduling: Computational study of local search and large-stepoptimization methods”, European J. of Operation Research, 83, pp. 347–364, 1995.

7. M. Livesey, “Clamping in Boltzmann machines”, IEEE Trans. on Neural Networks, Vol. 2,No. 1, pp. 143–148. January, 1991.

8. C.-E. Hong and B.M. McMillin, “Relaxing Synchronization in Distributed Simulated Anneal-ing”, IEEE Trans. on Parallel and Distributed Systems, Vol. 6, No. 2, pp. 189–195, February,1995.

9. V. Zissimopoulos, V.T. Paschos and F. Pekergin, “On the approximation of NP-complete prob-lems by using the Boltzmann Machine method: The cases of some covering and packingproblems”, IEEE Trans. on Computers, Vol. 40, No. 12, pp. 1413–1418, December, 1991.

10. I. Pramanick and J.G. Kuhl, “An inherently Parallel Method for Heuristic Problem-Solving:Part I – General Framework”, IEEE Trans. on Parallel and Distributed Systems, Vol. 6, No. 10,pp. 1006–1015, October, 1995.

11. S.-Y. Lee and K.G. Lee, “Synchronous and asynchronous parallel simulated annealing withmultiple Markov chains”, IEEE Trans. on Parallel and Distributed Systems, Vol. 7, No. 10,pp. 993–1007, October, 1996.