[IEEE 2005 International Conference on Neural Networks and Brain - Beijing, China (13-15 Oct. 2005)] 2005 International Conference on Neural Networks and Brain - A survey of neural

A survey of neural network ensembles

Ying Zhao Jun GaoDept. of Computer and Information

Hefei University of TechnologyHefei 230009,China

Institute of Intelligent Machines, ChineseAcademy of SciencesHefei 23003 1,China

E-mail: [email protected]@hotmail.com

Abstract-A neural network ensemble combines a finitenumber of neural networks or other types of predictors, whichare trained simultaneously for a common classification task.Compared with a single neural network, the ensemble is able toefficiently improve the generalization ability of the classifier.The objective of this paper is to introduce existing researchwork on the neural network ensembles, including effectiveanalysis, general implement steps of ensembles, and traditionaltechnologies for training component neural networks, and alsodescription the applications of it.

I. INTRODUCTION

Machine learning is the ability of a device to improve itsperformance based on past experience (Mitchell, 1997).However, the poor generalization, which will affect theperformance of system such as recognition rate, is becomingmore and more challenge to the machine learning. In orderto improve the generalization of a classification system, onone hand is combining the decisions of several classifiersrather than using the output of the best classifier in theensemble. And the other is the Support Vector Machine. Inthis paper, the existing research work on the neural networkensemble is reviewed in detail.An ensemble is a collection of a number of neural

networks or other types of predictors that are trained for thesame task (Sollich and Krogh, 1996). Figure 1 illustrates thebasic framework of a neural network ensemble. Eachnetwork in the ensemble is first trained using the sametrining sample. Then the output of each of these network

.A

(oi ) is combined to produce the output of the ensemble ( o).As each network makes generalization errors on differentsubsets of the input space, the collective decision producedby the ensemble is less likely to be in error than the decisionmade by any of the individual networks.

Neural network ensemble had gained much attention assoon as it was put forward (Hansen and Salamon, 1990). Inthe workshop at Neural Information Processing Conferenceof 1993, an entire symposium named "put it together' washeld especially for the neural network ensemble. Recently ithas become a very hot topic in both neural networks and

Xuezhi YangDept. of Computer and Information

Hefei University ofTechnologyHefei 230009,China

E-mail:[email protected]

machine learning communities, and has already beensuccessfully applied to diversified areas such as opticalcharacter recognition, face recognition, image analysis,financial trend prediction, etc.

A

0ensemble output

combine network outputs

AO 1 0 ON

network 1 network 2 * * networkN

input

Fig.1. Basic Framework ofNeural Network Ensemble

In the following section, we provide an explanation ofwhy ensembles can work well and also an introduction todesigning steps of the neural network ensemble. In the thirdsection, traditional ensemble techniques are presented. Theapplications are described in the forth section. Finallyconclusions on ensemble are given in section V.

II. ANALYSIS AND IMPLEMENT STEPS

A. Efficiency AnalysisKrough and Vedelsbby (1995) proved the formulation for

ensemble error in case of regression using a linearlyweighted ensemble. Assume the task is to learn a functionf,and the training samples are draw randomly from thedistribution p(x) . Suppose that the ensemble consists ofN

networks and the output of networka is called Va (x), thenthe final output ofthe ensemble is defined as Eq.(1):

0-7803-9422-4/05/$20.00 ©2005 IEEE438

V(X)= WaVa(X) (1)a

The diversity on input x of an individual network isdefined as aa = (Va (x)- V(x))2.Then the ensemblediversity on input x is:

a(x) = 0Eaaaa =Xca (Va (X)-V(X))2 (2)a a

The quadratic errors of the network a and of theensemble are respectively in Eq. (3) and Eq. (4):

ca (x) = (f(x) -Va (x))2 (3)e(x) = (f(x)- V(X))2 (4)

Another form for e(x) can be conducted according toEq.(2):

e(x) = Za)a a(x)-a(x) (5)a

We define E a(x) , Aa(x) and E(x) to be the averages,

over the input distribution, of £a (X) , aa (x) and e(x)respectively, shown in Eq.(6) to Eq.(8).

Ea (x) = Jdxgx)6a (x) (6)

A" (x) = dxp(x)aa (x) (7)E(x) = dxp(x)£(x) (8)

From Eq. (6), the ensemble generalization error E can beformulated as:

E=E-A (9)E = aEa (x) is the weighted average of the

a

generalization errors of the individual networks, andA = A a (x) is the weighted average of diversity

a

among these networks, which is a nonnegative value. Eq.(9)shows that an ideal ensemble consists of highly correctnetworks that disagree as much as possible, and thegeneralization error of the ensemble is always smaller thanthe average of the individual errors, that is E < E . Inparticular for uniform weights:

E 1 Ea (10)a

B. Design StepsA neural network ensemble is constructed by two steps,

one is design a number of individual neural networks andthe other is combing their predictions according to a certainrules.

1) Design the individual neural networksCombining the output of several classifiers is useful only

if they disagree on some inputs. Theoretical and empiricalwork showed that an effective ensemble should consist of aset ofnetworks that are not only highly correct, but ones thatmake their errors on different parts ofthe input space as well.(Hansen and Salamon, 1990; Krogh and Vedelsdy,1995).Generally speaking, the approaches for designing thenetworks can be conducted into three groups. First group ofthese approaches obtains diverse individuals by adoptingdifferent weight function, architectures, and network types,number of neuron in hidden layer, learning algorithm andinitial state in weight space. Second one gets diverseindividuals by training them on different training set, suchas bagging, Boosting, cross validation (Breiman, 1996;Schapire, 1990; Krough and Vedelsdy, 1995). Both the firstone and the second generate a group of networks which areerror uncorrelated directly. Partridge (1996) experimentallycompared the capabilities of the method above andconcluded that varying the net type and the training data arethe two best ways for creating ensembles of networksmaking different errors.The last group of these approaches is to generate a large

number of initial networks and from which selecting severaluncorrelated, consisting of the ensemble. Opitz and Shavlik(1996) proposed approach based on generic algorithm,searching for highly diverse set of accurate trained networks;Lazarevic and Obradoric (2001) proposed a pruningalgorithm to eliminate redundant classifier; Zhou et al.(2001)described a selective constructing approach for ensemble;clustering-based selective neural network ensemble isproposed by Fu et al.(2005).

2) Combing the outputsWhen the ensemble is used in classifying, voting is

usually been used. The most powerful voting rule appears tobe plurality and the majority voting rule. It is proved(Hansen and Salamon, 1990) that for a neural networkensemble, if the average rate for an example is less than50% and the networks in the ensembles are independent inthe production of their errors, then the expected error forthat example can be reduced to zero as the number ofnetworks combined goes to infinity. However, suchassumptions rarely hole in practice due to the fact that thenetworks are not independent.When the ensemble is used in regression, simple average

and weighted average are always been used. Opitz andShavlik (1996) has pointed out that simple averagingoutperforms since optimizing the combining weights caneasily lead to the problem of over-fitting. Perrone andCopper (1993) considers weighted average has a betterperformance as each network can avoided over-fitting byusing a cross-validatory stopping rule. Sollich and Krogh(1996) found that in large ensembles, one should use thesimple averaging. In this way, the globally optimalgeneralization error on the basis of all the available data canbe reached by optimizing the training set sizes of theindividual network. For ensembles of more realistic size,optimizing the ensemble weights can still yield substantially

439

better generalization performance than an optimally chosensingle network trained on all data with the same amount oftraining noise.

If the network outputs are interpreted as fuzzymembership values, belief values or evidence values, thenthe belief functions and Dempster-Shafer techniques areused for combination (Xu et al., 1992).

III. TRADITIONAL TECHNIQUES FOR TRAININGINDIVIDUALS IN ENSEMBLES

Many approaches for designing individuals in ensemblehave been developed. In this section, we focus on threerepresentative methods of them.

A. Bagging (Bootstrap Aggregating)Bagging (Breiman,1996) consists on generating different

datasets drawn at random with replace from the originaltraining set and then trains the different networks inensemble with these different databases. Some examples oforiginal training set may be repeated in resulting training setwhile others may be left out. If perturbing the learning setcan cause significant changes in the ensemble constructed,such as neural network, decision tree and linear regression,then bagging can improve accuracy.

B. BoostingBoosting is a general and provably effective method for

improving the accuracy of any given learning algorithm.Working in Valiant's PAC learning model, Kearns andValiant (1989) was first to pose the question whether a weaklearning model that performs just slightly better thanrandom guessing can be "boosting" into an arbitrarilyaccurate strong learning model. Schapire (1990) and Freund(1995) proved the equivalence between the strong learningmodel and weak learning model, and gave a boostingapproach that convert the weak learning model into stronglearning model directly.

In Boosting ensemble, the distribution of a particulartraining set in the series is over-represented by the patternsthat the earlier classifiers in the series recognize incorrectly.The individual classifiers are trained hierarchically to learnharder and harder parts if a classification problem.

Despite the potential benefits of boosting promised by thetheoretical results, the true practical value of boosting canonly be assessed by testing the method on "real" learningproblems. AdaBoosting algorithm, introduced by Freundand Schapire (1995), solved many of the practicaldifficulties of the earlier boosting algorithm. Experiments(Freund and Schapire, 1996; Drucker and Cortes, 1996)have shown the effectiveness of AdaBoosting. In thealgorithm the successive networks are trained with a trainingset selected at random from the original training set, but theprobability of selecting a pattern changes depending on the

correct classification of the pattern and on the performanceof the last trained network.

It was demonstrated by Dietterich (2000) that when thenumber of outliers is very large, the emphasis placed on thehard samples can become detrimental to the performance ofthe AdaBoost. Friedman (2000) put forward a variant ofAdaBoost, called "Gentle AdaBoost" that puts less emphasison outiers. Freund (2001) suggest another algorithm called"BrownBoost", which is an adaptive version of Freund(1995), demonstrates an intriguing connection betweenboosting and Brownian motion.

C. Cross-validation EnsembleCross-validation, a statistical method providing a method

for lowering correlations was used in Krough and Vedelsbby(1995). They discussed one method to make even identicalnetworks disagree. It was done by training the individualson different training sets by holding out some examples foreach individual during training. This had the addedadvantage that these examples could be used for testing, andthereby one could obtain good estimates of thegeneralization error.

All of these methods are using different training sets,which are effective and be popular used in practice.Boosting encompasses a family of methods such asAdaBoost, Gentle AdaBoost, and BrownBoost, etc. As sameas bagging, boosting also relies on "resampling" techniquesto obtain different training sets for each of the classifiers.Many researchers had compared bagging and boosting(Opitz and Maclin, 1997; Shen et al, 2000). For the baggingalgorithm, training set is generated by randomly drawing,with replacement. There is no relationship between differenttraining sets; therefore, the networks can be trained at thesame time. In the boosting algorithm, training set is chosenbased on the performance of the earlier network. So thenetworks must be trained consecutive. In conclusion, as ageneral technique, Bagging is probably appropriate for mostproblems, but when properly applied, Boosting may producelarge gains in accuracy.

IV. APPLICATIONS

Neural network ensembles are applied to handwrittendigit recognition by Hasen and Salamon (1992). It is shownthat ensemble consensus outperform the best individual ofthe ensemble by 20-25%. The first experiments with earlyboost algorithms were carried out by Drucker et al. (1993)on an OCR task. Mao (1998) studied the effectiveness ofthree neural network ensembles in improving OCRperformance: basic, bagging and boosting. It is proved thatthe performances of different methods are depended onreject rate in recognition,

Gutta (1996) described a novel approach for fullyautomated face recognition based on a hybrid architectureconsisting of an ensemble of connectionist network and

440

inductive decision trees. Moreover, neural network is alsobeen applied to view invariant face recognition (Zhou et al;2001). The ensemble can not only give out the recognitionresult but also present estimated view information.A specific two-layer ensemble architecture, which

combines the advantages of the adaptive resonance theoryand the field theory, is devised to identify the lung cancercell by Jiang et al. (2001).

Moreover, neural network ensemble has been applied intext classification (Schapire and Singer, 2000), seismicsignals classification (Shimshoni and Intrator, 1998),financial decision (West et al., 2004), etc.

V. CONCLUSIONS

A neural network ensemble is a very successful techniquewhere the outputs of a set of separately trained neuralnetworks are combined to form one unified prediction. Firstit can improve the generalization performance of aclassification system greatly. Second, it can be viewed as aneffective approach for neural computing as a result of itsvariety of potential applications and validity. Third, localminima are available for neural network ensembles.Individuals in neural network ensemble are expected todifferent local minima of error surface, increasing thediversity of ensemble. Although the neural networkensembles have been used widely, the key problem forresearchers is how to effectively design the individual worksthat are not only highly correct, but also different as much aspossible.

ACKNOWLEGEMENT

This project is supported by National Natural ScienceFoundation of China (No. 60175011, No. 60375011) andNatural Science Foundation of Anhui Province (No.04042044) and Supported by Program for New CenturyExcellent Talents in University (NCET).

REFERENCES

[1] A. Krogh, J. Vedelsby, "Neural network ensembles, cross validation,and active learning," Advances in Neural Informnation ProcessingSystems, vol.8, pp. 231-238, 1995.

[2] A Lazarevic, Z. Obradovic, "Effective pruning of neural networkclassifier ensembles," Proc. International Joint Conference onNeural Networks, 2001, vol. 2, pp. 796-801.

[3] D. Opitz, J. Shavlik, "Actively searching for an effective neuralnetwork ensemble," Connection Science, 8(3-4), pp. 337-353, 1996.

[4] D. Partridge,, "Network generalization differences quantified,"Neural Networks, vol. 9, pp 263-271, 1996.

[5] D.W. Opitz, R.F. Maclin, "An Empirical Evaluation of Bagging andBoosting for Artificial Neural Networks," International Conferenceon neural networks, vol. 3, pp. 1401-1405, 1997.

[6] D. West, S. Dellana, J. Quian, "Neural network ensemble strategiesfor financial decision," applications, Computers & OperationsResearch, vol. 32(10), pp. 2543-2559, 2005.

[7] H. Drucker, R. Schapire, P. Simard, "Boosting performance in neuralnetworks," International Journal of Recognition and Artificial

Intelligence, vol. 7(4), pp. 705-719, 1993.[8] H. Drucker and C. Cortes. "Boosting decision trees". In Advances in

Neural Inform ation Processing Systems 8, pages 479-485, 1996.[9] J. Friedman, T. Hastie, R. Tibshirani, "Additive logistic regression: A

statistical view of boosting," The Annals ofStatistics, vol. 38(2), pp.337-374, April 2000.

[10] J. Mao, "A case study on bagging, boosting and basic ensembles ofneural networks for OCR," Proc the IEEE International JointConference on Neural Networks, Anchorage,AK,1998,vol. 3, pp.1828-1833

[11] K. Tumer, J. Ghosh, "Error Correlation and Error Reduction inEnsemble Classifiers," Connection Science, Special issue oncombining artificial neural networks: ensemble approaches, volume8, numbers 3 & 4, pp 385-404, December 1996.

[12] L. Breiman, "Bagging predictors," Machine Learning, vol. 24(2), pp.123-140,1996

[13] L. Hansen, L Liisberg, P. Salamon, "Ensemble methods forhandwritten digit recognition," Proc the 1992 IEEE Workshop onNeural Networks for Signal Processing. Copenhagen, Denmark,1992, pp. 333-342.

[14] L. Hansen, P. Salamon, "Neural network ensembles," IEEE Trans.Pattern Analysis and Machine Intelligence, vol. 12, pp. 993-1001

[15] L. Xu, A. Krzyzak, C.Y. Suen, "Methods for combining multipleclassifiers and their applications to handwriting recognition," IEEETrans on System, Man, and Cybernetics, vol. 22, pp. 418-435, 1992.

[16] M. J. Kearns, L.G. Valiant, "Cryptographic limitations on learningBoolean formulae and finite automata," Proceedings of theTwenty-first Annual ACM Symposium on Theory of Computing,New York, 1989, pp. 443-444.

[17] M. P. Perrone, L. N. Copper, "When networks disagree: Ensemblemethod for neural networks," Artificial Neural Networks for Speechand Vision, pp. 126-142, 1993.

[18] P. Sollich, A. Krogh, "Learning with ensembles: How over-fittingcan be useful," Advances in Neural Information Processing System,vol. 8, Cambridge, pp.190-196, 1996.

[19] Q. Fu, S.X. Hu, S.Y. Zhao, "Clustering-based selective neuralnetwork ensemble," Joumal of Zhe jiang university(Science), Vol.6No.5, pp. 387-392,2005.

[20] R. E. Schapire, "The strength of weak learnability," MachineLearning, vol. 5(2), pp. 197-227, 1990.

[21] R E. Schapire, Y. Singer, "BoosTexter: A boosting-based system fortext categorization," Machine Learning, vol. 39(2-3), pp. 135-168,2000.

[22] S. Gutta, H. Huang, F. Iman, and H. Wechsler, "Face and HandGesture Recognition using Hybrid Classifiers," Second InternationalConference on Automate Face and Gesture Recognition, Killington,VT, pp. 164-169, 1996.

[23] T. Mitchell, Machine learning, McGraw-Hill, 1997[24] T. G. Dietterich, "An experimental comparison of three methods for

constructing ensembles of decision trees: Bagging, boosting, andrandomization," Machine learning, vol. 40(2), pp. 139-158, 2000.

[25] X.H. Shen, Z.H.Zhou, J.X. Wu, Z.Q,Chen, "Survey of Boosting andBagging," Computer Engineering and Applications, 2000,vol.36(12),pp.31-32.

[26] Y. Freund, "An adaptive version of the boost by majority algorithm,"Machine Learning, vol. 43(3), pp. 293-318, 2001.

[27] Y. Freund, "Boosting a weak learning algorithm by majority,"Information and Computation, vol. 121(2), pp. 256-285, 1995.

[28] Y. Freund, R. E. Schapire, "A decision-theoretic generalization ofon-line learning and an application to boosting," Computer andSystem Sciences, vol. 55(1), pp. 119-139, 1997.

[29] Y. Freund, R. E. Schapire, "Experitnents with a new boostingalgorithm," In Machine Learning: Proceedings of the ThirteenthInternational Conference, pp. 148-156, 1996.

[30] Y. Jiang, Z.H. Zhou, Q. Xie, Z.Y. Chen, "Applications of neuralnetwork ensemble in lung cancer cell identification," Journal of NanJing university (Science), vol. 37(5), pp.529-534, 2001.

[31] Y. Shimshoni, N. Intrator, "Classification of seismic signals byintegrating ensemble of neural networks," IEEE Tran SignalProcessing, vol. 46(5), pp. 1194-1201, 1998.

441

[32] Z. H. Zhou, F.J. Huang, H.J. Zhang, Z.H. Chen, "View-invariantface recognition based on neural network ensemble," Journal ofComputer Research and Development, vol. 38(10), pp. 1204-1210,2001.

[33] Z. H. Zhou, Y.J. Wu,, S.F. Chen, "Genetic algorithm based onselective neural network ensemble" Proc. 17'* International JointConference on Artificial Intelligence, vol. 2, pp. 797-802, 2001.

442

Documents

[IEEE 2005 International Conference on Neural Networks and Brain - Beijing, China (13-15 Oct. 2005)] 2005 International Conference on Neural Networks and Brain - A survey of neural