[IEEE 2005 IEEE International Joint Conference on Neural Networks, 2005. - MOntreal, QC, Canada (July 31-Aug. 4, 2005)] Proceedings. 2005 IEEE International Joint Conference on Neural

Proceedings of International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 2005

How to Find Different Neural Networks by

Negative Correlation Learning

Yong LiuSchool of Computer Science

The University of AizuAizu-Wakamatsu, Fukushima 965-8580, Japan

E-mail: [email protected]

Abstract- Two penalty functions are introduced in the negativecorrelation learning for finding different neural networks in anensemble. One is based on the average output of the ensemble.The other is based on the classification. The idea of penaltyfunction based on the average output is to make each individualnetwork has the different output value to that of the ensemble onthe same input. In comparison, the penalty function based on theclassification is to lead each individual network to have differentclass to that of the ensemble on the same input. Experiments ona classification task show how the negative correlation learninggenerates different neural networks with two different penaltyfunctions.

I. INTRODUCTION

Neural network ensembles adopt the divide-and-conquerstrategy. Instead of using a single network to solve a task, anneural network ensemble combines a set of neural networkswhich learn to subdivide the task and thereby solve it moreefficiently and elegantly. An neural network ensemble offersseveral advantages over a monolithic neural network. The ideaof designing an ensemble learning system consisting of manysubsystems can be traced back to as early as 1958. Sincethe early 1990's, algorithms based on similar ideas have beendeveloped in many different but related forms, such as neuralnetwork ensembles [1], [2], mixtures of experts [3], [4], [5],[6], various boosting and bagging methods [7], [8], [9], andmany others. It is essential to find different neural networksin an ensemble because there is no improvement by combingthe same neural networks. There are a number of methodsof finding different neural networks including independenttraining, sequential training, and simultaneous training.A number of methods have been proposed to train a set

of neural networks independently by varying initial randomweights, the architectures, the learning algorithm used, and thedata [1], [10]. Experimental results have showed that networksobtained from a given network architecture for different initialrandom weights often correctly recognize different subsets of agiven test set [1], [10]. As argued in [1], because each networkmakes generalisation errors on different subsets of the inputspace, the collective decision produced by the ensemble isless likely to be in error than the decision made by any of theindividual networks.Most independent training methods emphasised indepen-

dence among individual neural networks in an ensemble.One of the disadvantages of such a method is the loss of

interaction among the individual networks during learning.There is no consideration of whether what one individuallearns has already been learned by other individuals. Theerrors of independently trained neural networks may still bepositively correlated. It has been found that the combiningresults are weakened if the errors of individual networks arepositively correlated [11]. In order to decorrelate the individualneural networks, sequential training methods train a set ofnetworks in a particular order [9], [12], [13]. Drucker et al.[9] suggested training the neural networks using the boostingalgorithm. The boosting algorithm was originally proposed bySchapire [8]. Schapire proved that it is theoretically possible toconvert a weak learning algorithm that performs only slightlybetter than random guessing into one that achieves arbitraryaccuracy. The proof presented by Schapire [8] is constructive.The construction uses filtering to modify the distribution ofexamples in such a way as to force the weak learning algorithmto focus on the harder-to-learn parts of the distribution.

Most of the independent training methods and sequentialtraining methods follow a two-stage design process: firstgenerating individual networks, and then combining them. Thepossible interactions among the individual networks cannotbe exploited until the integration stage. There is no feedbackfrom the integration stage to the individual network designstage. It is possible that some of the independently designednetworks do not make much contribution to the integratedsystem. In order to use the feedback from the integration,simultaneous training methods train a set of networks together.Negative correlation learning [14], [15], [16] is an example ofsimultaneous training methods. The idea of negative corre-lation learning is to encourage different individual networksin the ensemble to learn different parts or aspects of thetraining data, so that the ensemble can better learn the entiretraining data. In negative correlation learning, the individualnetworks are trained simultaneously rather than independentlyor sequentially. This provides an opportunity for the individualnetworks to interact with each other and to specialise.

In this paper, two penalty functions are introduced inthe negative correlation learning for finding different neuralnetworks in an ensemble. One is based on the average outputof the ensemble. The other is based on the classification. Theidea of penalty function based on the average output is to makeeach individual network has the different output value to that

0-7803-9048-2/05/$20.00 ©2005 IEEE 3330

of the ensemble on the same input. In comparison, the penaltyfunction based on the classification is to lead each individualnetwork to have different class to that of the ensemble onthe same input. Experiments on a classification task showhow the negative correlation learning generates different neuralnetworks with two different penalty functions.The rest of this paper is organised as follows: Section II

describes negative correlation learning with two differentpenalty functions; Section III discusses how negative correla-tion learning generates different neural networks on a patternclassification problem; and finally Section IV concludes witha summary of the paper and a few remarks.

Both theoretical and experimental results [1 1] have in-dicated that when individual networks in an ensemble areunbiased, average procedures are most effective in combiningthem when errors in the individual networks are negativelycorrelated and moderately effective when the errors are uncor-related. There is little to be gained from average procedureswhen the errors are positively correlated. In order to createa population of neural networks that are as uncorrelatedas possible, the mutual information between each individualneural network and the rest of population should be minimised.Minimising the mutual information between each individualneural network and the rest of population is equivalent tominimising the correlation coefficient between them.

II. NEGATIVE CORRELATION LEARNINGGiven the training data set D =

{(x(l),y(1)), -- ,(x(N),y(N))}, we consider estimating yby forming an neural network ensemble whose output is asimple averaging of outputs Fi of a set of neural networks.All the individual networks in the ensemble are trained onthe same training data set D

F(n) = ME4!lF (n) (1)

where Fi(n) is the output of individual network i on the nthtraining pattern x(n), F(n) is the output of the neural networkensemble on the nth training pattern, and M is the number ofindividual networks in the neural network ensemble.The idea of negative correlation learning is to intro-

duce a correlation penalty term into the error functionof each individual network so that the individual networkcan be trained simultaneously and interactively. The er-ror function Ei for individual i on the training data setD = {(x(1), y(1)), - * -, (x(N), y(N))} in negative correla-tion learning is defined by

Ei = 1 EN 1E1(n)

=N ' L 2t] (2)

where N is the number of training patterns, Ei(n) is the valueof the error function of network i at presentation of the nthtraining pattern, and y(n) is the desired output of the nthtraining pattern. The first term in the right side of Eq.(2) is themean-squared error of individual network i. The second term

pi is a correlation penalty function. The purpose of minimisingpi is to negatively correlate each individual's error with errorsfor the rest of the ensemble. The parameter A is used to adjustthe strength of the penalty.

A. Penalty Based on the Average OutputThe penalty function pi has the form

pavei(n) = - (Fi (n) -F(n) 2 (3)

The partial derivative of Ei with respect to the output ofindividual i on the nth training pattern is

OE (n) = Fi(n)-y(n)-A(Fi(n)-F(n))aFi (n)= (1 - A)(Fi(n) - y(n)) + A(F(n) - y(n)X4)

where we have made use of the assumption that the outputof ensemble F(n) has constant value with respect to Fi(n).The value of parameter A lies inside the range 0 < A < 1so that both (1 - A) and A have nonnegative values. BP[17] algorithm has been used for weight adjustments in themode of pattern-by-pattern updating. That is, weight updatingof all the individual networks is performed simultaneouslyusing Eq.(4) after the presentation of each training pattern.One complete presentation of the entire training set duringthe learning process is called an epoch. Negative correlationlearning from Eq.(4) is a simple extension to the standard BPalgorithm. In fact, the only modification that is needed is tocalculate an extra term of the form A(Fi(n) - F(n)) for theith neural network.From Eq. (4), we may make the following observations.

During the training process, all the individual networks interactwith each other through their penalty terms in the errorfunctions. Each network Fi minimizes not only the differencebetween Fi(n) and y(n), but also the difference betweenF(n) and y(n). That is, negative correlation learning considerserrors what all other neural networks have learned whiletraining an neural network.

For A = 1, from Eq.(4) we get

49Ei(n)= F(n) - y(n)

8Fi (n) (5)

Note that the error of the ensemble for the nth training patternis defined by

11Eensemble = 2M i=Fi(n) - y(n))2 (6)

The partial derivative of Eensemble with respect to Fi on thenth training pattern is

OEensemble = 1

0F1(n) - ~jjk7A2i=1ifl() -ykfnJ1

- M (F(n)-y(n))In this case, we get

aEi(n) OEensemble0F1(n) F1i(n)

(7)

(8)

3331

The minimisation of the error function of the ensemble isachieved by minimising the error functions of the individualnetworks. From this point of view, negative correlation learn-ing provides a novel way to decompose the learning task ofthe ensemble into a number of subtasks for different individualnetworks.

B. Penality Based on the ClassificationThe penalty function pavei was firstly proposed for solving

regression tasks. It had been applied to classification tasks.Experimental studies on both regression tasks and classifi-cation problems had shown such a penalty function couldsuccessfully lead to the different neural networks in an ensem-ble. However, for the classification tasks, the neural networkswith the different output on the same data sample might notnecessarily have the different classifications on the data. Itwould be more desirable to generate different neural networksbased on the classification. For a classification task with twoclasses, the penalty function can be defined as

pdai(n) = (Fi(n) - 0.5))(F(n) - 0.5) (9)

The minimization of pdlai(n) would make Fi(n) to havedifferent classification to that of the ensemble on some datasamples.

In order to simplize the calculation of the partial derivativeof Ei, we can assume that the output of ensemble F(n) hasconstant value with respect to Fi(n). The partial derivative ofEi with respect to the output of individual i on the nth trainingpattern is

OEi(n) - Fi(n)-y(n) + A(F(n)-0-5) (10)

For A = 1, from Eq.(10) we get

E1(n) = (Fi (n) y(n) + 0.5 )+ (F(n) - y(n) + 0.50F1(n) 2 2

In this case, the minimization of pclai(n) would make both Fiand F close to the value of y(n)+O-5 that is a changed targetouput.

average procedures when the errors are positively correlated.

III. EXPERIMENTAL STUDIES

This section describes the application of negative correlationlearning to the Australian credit card assessment problem. Theproblem is to assess applications for credit cards based on anumber of attributes. There are 690 patterns in total. The out-put has two classes. The 14 attributes include 6 numeric valuesand 8 discrete ones, the latter having from 2 to 14 possiblevalues. The Australian credit card assessment problem is aclassification problem which is different from the regressiontype of tasks, such as the chlorophyll-a prediction problem,whose outputs are continuous. The data set was obtained fromthe UCI machine learning benchmark repository. It is availableby anonymous ftp at ics.uci.edu (128.195.1.1) in directory/pub/machine-learning-databases.

1) Experimental Setup: The data set was partitioned intotwo sets: a training set and a testing set. The first 518examples were used for the training set, and the remaining172 examples for the testing set. The input attributes wererescaled to between 0.0 and 1.0 by a linear function. Theoutput attributes of all the problems were encoded using a 1-of-m output representation for m classes. The output with thehighest activation designated the class.The ensemble architecture used in the experiments has four

networks. Each individual network is a feedforward networkwith one hidden layer. All the individual networks have tenhidden nodes. The number of training epochs was set to 1000.

2) Experimental Results: Table I shows the average resultsof negative correlation learnings over 25 runs. Each run ofnegative correlation learning was from different initial weights.The simple averaging was first applied to decide the output ofthe ensemble system. For the simple averaging, the results ofnegative correlation learning with pave were slightly worsethan those of negative correlation learning with pcla.

In simple averaging, all the individual networks have thesame combination weights and are treated equally. However,not all the networks are equally important. Because differentindividual networks created by negative correlation learningwere able to specialise to different parts of the testing set, onlythe outputs of these specialists should be considered to makethe final decision of the ensemble for this part of the testingset. In this experiment, a winner-takes-all method was appliedto select such networks. For each pattern of the testing set, theoutput of the ensemble was only decided by the network whoseoutput had the highest activation. Table I shows the averageresults of negative correlation learning over 25 runs usingthe winner-takes-all combination method. The winner-takes-all combination method improved negative correlation learningwith the penalty pave because there were good and poornetworks for each pattern in the testing set and winner-takes-all selected the best one. However it did not improved negativecorrelation learning with the penalty pcda. It is likely thatthe changed target output in the negative correlation learningbased on the classification might not favor the winner-takes-allcombination method.

In order to see how different neural networks generated bynegative correlation learning are, we compared the outputs ofthe individual networks trained with different penalty terms.Two notions were introduced to analyse negative correlationlearning. They are the correct response sets of individualnetworks and their intersections. The correct response set Siof individual network i on the testing set consists of all thepatterns in the testing set which are classified correctly bythe individual network i. Let Qi denote the size of set Si,and Qili2...i denote the size of set Si1 n Si2 n *..- n Sik -

Table II shows the sizes of the correct response sets ofindividual networks and their intersections on the testing set,where the individual networks were respectively created bynegative correlation learning and independent training. It isevident from Table II that different individual networks createdby negative correlation learning with the penalty pave were

3332

able to specialise to different parts of the testing set. Forinstance, in negative correlation learning with pave, in Table IIthe sizes of both correct response sets Si and S3 were 147and 138, but the size of their intersection S, n s3 was 126.The size of S1 n s2 n s3 s4 was only 116. In contrast,the individual networks in the ensemble created by negativecorrelation learning with pcda were less different.

TABLE ICOMPARISON OF ERROR RATES AMONG NEGATIVE CORRELATION

LEARNINGS WITH THE PENALTY pave (A = 1.0) AND THE PENALTY pcda

(A = 1.0, 5.0, ) ON THE AUSTRALIAN CREDIT CARD ASSESSMENT

PROBLEM. THE RESULTS WERE AVERAGED OVER 25 RUNS. "SIMPLEAVERAGING" AND "WINNER-TAKES-ALL" INDICATE TWO DIFFERENT

COMBINATION METHODS USED IN NEGATIVE CORRELATION LEARNING.

Mean, SD, Min AND Max INDICATE THE MEAN VALUE, STANDARD

DEVIATION, MINIMUM AND MAXIMUM VALUE, RESPECTIVELY.

TABLE 11

THE SIZES OF THE CORRECT RESPONSE SETS OF INDIVIDUAL NETWORKS

CREATED RESPECTIVELY BY NEGATIVE CORRELATION LEARNING WITH

THE PENALTY pave (A = 1.0) AND THE PENALTY pcla (A = 1.0, 5.0, ) ON

THE TESTING SET AND THE SIZES OF THEIR INTERSECTIONS FOR THE

AUSTRALIAN CREDIT CARD ASSESSMENT PROBLEM. THE RESULTS WERE

OBTAINED FROM THE FIRST RUN AMONG THE 25 RUNS.

pave with A = 1.0 Qi = 147 Q2 = 150 | Q3 = 138Q4 = 142 Q12=142 S13 = 126

Q214 = 136 Q23 = 125 Q24 = 136

234 = 123 Q123 = 121 Q124 = 134

Q134 = 118 Q234 = 118 21234 = 116pcla with A = 1.0 Ql = 145 Q2 = 147 Q3 = 142

Q4 = 144 Q12 = 140 p13 = 137

Q14 = 140 1223 = 136 Q24 = 140

Q34 = 135 Q123 = 132 Q124 = 136

12134 = 133 Q234 = 131 21234 = 129

pcla with A = 5.0 Q1 = 150 Q2 = 146 Q3 = 139Q4 = 145 Q12 = 139 Q13 = 136

Q14 = 142 223 = 130 Q24 = 139

f234 = 135 Q123 = 128 12124 = 136

Q_134 = 133 Q234 = 129 Q1234 = 127

smaller and specialised ones, so that each subproblem can bedealt with by an individual neural network relatively easily.Two penalty terms in the error function were proposed to

encourage the formation of different neural networks.The experimental results on a classification task show that

the penalty function based on the average output of an ensem-

ble tends to generate different neural networks. In contrast, thepenalty function based on the classification did not work as ithad been expected to generate more different neural networks.More study is needed on the two different penalty functions to

shed light on how to design better penalty functions in negativecorrelation learning.

REFERENCES[1] L. K. Hansen and P. Salamon. Neural network ensembles. IEEE Trans.

on Pattern Analysis and Machine Intelligence, 12(10):993-1001, 1990.[2] A. J. C. Sharkey. On combining artificial neural nets. Connection

Science, 8:299-313, 1996.[3] R. A. Jacobs, M. 1. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive

mixtures of local experts. Neural Computation, 3:79-87, 1991.[4] R. A. Jacobs and M. I. Jordan. A competitive modular connectionist

architecture. In R. P. Lippmann, J. E. Moody, and D. S. Touretzky,editors, Advances in Neural Injormnation Processing Systems 3, pages

767-773. Morgan Kaufmann, San Mateo, CA, 1991.[5] R. A. Jacobs, M. 1. Jordan, and A. G. Barto. Task decomposition through

competition in a modular connectionist architecture: the what and wherevision task. Cognitive Science, 15:219-250, 1991.

[6] R. A. Jacobs. Bias/variance analyses of mixture-of-experts architectures.Neural Comnputation, 9:369-383, 1997.

[7] H. Drucker, C. Cortes, L. D. Jackel, Y. LeCun, and V. Vapnik. Boostingand other ensemble methods. Neural Comnputation, 6:1289-1301, 1994.

[8] R. E. Schapire. The strength of weak learnability. Machine Learning,5:197-227, 1990.

[9] H. Drucker, R. Schapire, and P. Simard. Improving performance inneural networks using a boosting algorithm. In S. J. Hanson, J. D.Cowan, and C. L. Giles, editors, Advances in Neural InformationProcessing Systems 5, pages 42-49. Morgan Kaufmann, San Mateo,CA, 1993.

[10] D. Sarkar. Randomness in generalization ability: a source to improve it.IEEE Trans. on Neural Networks, 7(3):676-685, 1996.

[11] R. T. Clemen and R. .L Winkler. Limits for the precision and value ofinformation from dependent sources. Operations Research, 33:427-442,1985.

[12] D. W. Opitz and J. W. Shavlik. Actively searching for an effective neuralnetwork ensemble. Connection Science, 8:337-353, 1996.

[13] B. E. Rosen. Ensemble learning using decorrelated neural networks.Connection Science, 8:373-383, 1996.

[14] Y. Liu and X. Yao. Negatively correlated neural networks can producebest ensembles. Australian Journal of Intelligent Information ProcessingSystems, 4:176-185, 1998.

[15] Y. Liu and X. Yao. A cooperative ensemble learning system. In Proc.of the 1998 IEEE International Joint Conference on Neural Networks(IJCNN'98), pages 2202-2207. IEEE Press, Piscataway, NJ, USA, 1998.

[16] Y. Liu and X. Yao. Simultaneous training of negatively correlatedneural networks in an ensemble. IEEE Trans. on Systems, Man, andCybernetics, Part B: Cybernetics, 29(6):716-725, 1999.

[17] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internalrepresentations by error propagation. In D. E. Rumelhart and J. L.McClelland, editors, Parallel Distributed Processing: Explorations inthe Microstructures of Cognition, Vol. 1, pages 318-362. MIT Press,Cambridge, MA, 1986.

IV. CONCLUSIONSThis paper describes negative correlation learning for gen-

erating different neural networks in an ensemble. It can beregarded as one way of decomposing a large problem into

3333

Simple Averaging Winner-Takes-AllError Rate Training Test Training Test

pave Mean 0.0679 0.1323 0.1220 0.1293with A = 1.0 SD 0.0078 0.0072 0.0312 0.0099

Min 0.0463 0.1163 0.0946 0.1105Max 0.0772 0.1454 0.1448 0.1512

pcla Mean 0.0589 0.1293 0.0598 0.1330with A = 1.0 SD 0.0057 0.0083 0.0063 0.0121

Min 0.0463 0.1105 0.0502 0.1105Max 0.0714 0.1512 0.0753 0.1686

pcla Mean 0.075 0.1377 0.0758 0.1367with A = 5.0 SD 0.0048 0.0080 0.0059 0.0099

Min 0.0676 0.1221 0.0656 0.1279Max 0.0888 0.1570 0.0869 0.1628

Documents

[IEEE 2005 IEEE International Joint Conference on Neural Networks, 2005. - MOntreal, QC, Canada (July 31-Aug. 4, 2005)] Proceedings. 2005 IEEE International Joint Conference on Neural