6
Proceedings of Intemational Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 2005 Effective neural network pruning using cross-validation Thuan Q. Huynh School of Computing National University of Singapore Singapore 117543 E-mail: [email protected] Abstract-This paper addresses the problem of finding neural networks with optimal topology such that their generalization capability is maximized. Our approach is to combine the use of a penalty function during network training and a subset of the training samples for cross-validation. The penalty is added to the error function so that the weights of network connections that are not useful have small magnitude. Such network connections can be pruned if the resulting accuracy of the network does not change beyond a preset level. Training samples in the cross- validation set are used to indicate when network pruning is terminated. Our results on 32 publicly available data sets show that the proposed method outperforms existing neural network and decision tree methods for classification. I. INTRODUCTION Multilayer feedforward neural networks have been shown to be very useful for solving many real-world problems both for classification as well as for regression. However, determining the appropriate network size is still one of the most difficult problems to address in neural network application. An over- sized network will learn from the training sample fast, but it tends to memorize the training samples without obtaining the capability for generalization and thus, it could perform very badly when presented with the data samples. On the other hand, an undersized network may not attain sufficiently small error on the training samples. There are two types of methods for building neural networks with optimal size. The first type starts with a small network and grows this network by adding hidden units or hidden layer as learning progresses. The second type starts with a large network with more than the necessary number of hidden units and then prunes off the weights or units that are irrelevant for the learning. The cascade correlation algorithm [1], the self-organizing neural network [2], the MPyramid-real and the MT1ling-real [3] are algorithms which dynamically construct the network starting from a minimal network. There is also much work in removing weights and units from oversized networks [4], [5], [6]. In this paper, we restrict our discussion to classification problems and neural networks with just one single hidden layer. Hence, the problem of finding the appropriate topology of the network can be simplified as the problem of determining the optimal number of hidden units and connections in the Rudy Setiono School of Computing National University of Singapore Singapore 117543 E-mail: [email protected] network. The proposed method has two phases. The first phase is similar to the algorithm N2P2F [7], the network is trained with the penalty term added to the error function. Networks connections that satisfy certain conditions for their removal are pruned off. The conditions for pruning depend on the magnitudes of the connection weights as well as accuracy rates on both the training set and the cross-validation set. When there are no more network connections that satisfy these conditions, the second phase of the method is started by searching for hidden units that could possibly be removed. The method sorts the hidden units based on their impact on overall classification rate when the hidden unit is removed from the network individually. Starting from the hidden unit that have the least impact, the network is retrained without this hidden unit. This process is continued as long as there are hidden units which can be pruned. The outline of this paper is as follows. Section 2 describes the details of proposed pruning algorithm. Experimental results are presented in Section 3. Summary and conclusion are presented in Section 4. II. THE ALGORITHM At the start of the algorithm, we assume that we have a fully connected one hidden layer backpropagation neural network as shown in Figure 1. vpm Outpu layer pmi Inputlat Fig. 1. A fully connected feedforward neural network. Once the network has been trained to meet a certain error condition, the pruning process can be started. Network training is achieved by minimizing the following cross-entropy error 0-7803-9048-2/05/$20.00 02005 IEEE 972

[IEEE International Joint Conference on Neural Networks 2005 - Montreal, Que., Canada (31 July-4 Aug. 2005)] Proceedings. 2005 IEEE International Joint Conference on Neural Networks,

  • Upload
    r

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: [IEEE International Joint Conference on Neural Networks 2005 - Montreal, Que., Canada (31 July-4 Aug. 2005)] Proceedings. 2005 IEEE International Joint Conference on Neural Networks,

Proceedings of Intemational Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 2005

Effective neural network pruning using

cross-validationThuan Q. Huynh

School of ComputingNational University of Singapore

Singapore 117543E-mail: [email protected]

Abstract-This paper addresses the problem of finding neuralnetworks with optimal topology such that their generalizationcapability is maximized. Our approach is to combine the use ofa penalty function during network training and a subset of thetraining samples for cross-validation. The penalty is added to theerror function so that the weights of network connections thatare not useful have small magnitude. Such network connectionscan be pruned if the resulting accuracy of the network does notchange beyond a preset level. Training samples in the cross-validation set are used to indicate when network pruning isterminated. Our results on 32 publicly available data sets showthat the proposed method outperforms existing neural networkand decision tree methods for classification.

I. INTRODUCTION

Multilayer feedforward neural networks have been shown tobe very useful for solving many real-world problems both forclassification as well as for regression. However, determiningthe appropriate network size is still one of the most difficultproblems to address in neural network application. An over-sized network will learn from the training sample fast, but ittends to memorize the training samples without obtaining thecapability for generalization and thus, it could perform verybadly when presented with the data samples. On the otherhand, an undersized network may not attain sufficiently smallerror on the training samples.

There are two types of methods for building neural networkswith optimal size. The first type starts with a small networkand grows this network by adding hidden units or hidden layeras learning progresses. The second type starts with a largenetwork with more than the necessary number of hidden unitsand then prunes off the weights or units that are irrelevant forthe learning.The cascade correlation algorithm [1], the self-organizing

neural network [2], the MPyramid-real and the MT1ling-real[3] are algorithms which dynamically construct the networkstarting from a minimal network. There is also much work inremoving weights and units from oversized networks [4], [5],[6].

In this paper, we restrict our discussion to classificationproblems and neural networks with just one single hiddenlayer. Hence, the problem of finding the appropriate topologyof the network can be simplified as the problem of determiningthe optimal number of hidden units and connections in the

Rudy SetionoSchool of Computing

National University of SingaporeSingapore 117543

E-mail: [email protected]

network. The proposed method has two phases. The first phaseis similar to the algorithm N2P2F [7], the network is trainedwith the penalty term added to the error function. Networksconnections that satisfy certain conditions for their removalare pruned off. The conditions for pruning depend on themagnitudes of the connection weights as well as accuracyrates on both the training set and the cross-validation set.When there are no more network connections that satisfythese conditions, the second phase of the method is startedby searching for hidden units that could possibly be removed.The method sorts the hidden units based on their impact onoverall classification rate when the hidden unit is removedfrom the network individually. Starting from the hidden unitthat have the least impact, the network is retrained withoutthis hidden unit. This process is continued as long as thereare hidden units which can be pruned.The outline of this paper is as follows. Section 2 describes

the details of proposed pruning algorithm. Experimental resultsare presented in Section 3. Summary and conclusion arepresented in Section 4.

II. THE ALGORITHM

At the start of the algorithm, we assume that we have a fullyconnected one hidden layer backpropagation neural network asshown in Figure 1.

vpm Outpu layer

pmi

Inputlat

Fig. 1. A fully connected feedforward neural network.

Once the network has been trained to meet a certain errorcondition, the pruning process can be started. Network trainingis achieved by minimizing the following cross-entropy error

0-7803-9048-2/05/$20.00 02005 IEEE 972

Page 2: [IEEE International Joint Conference on Neural Networks 2005 - Montreal, Que., Canada (31 July-4 Aug. 2005)] Proceedings. 2005 IEEE International Joint Conference on Neural Networks,

function [8] augmented with a penalty term:

E tS tpdlogpd+ ( -tpd)log(l -Opd) +O(w)dET pEoutputs

(1)where:

* T is the set of training samples.* tpd E {0, 1} is the target value for pattern Xd at output

unit p* opd is the output value for pattern Xd at output unit p* w is the vector of all weights in the network* 6(w) is the penalty term added

The activation of hidden unit m is calculated as the hyperbolictangent function 6(y) = (eY e-Y)/(eY+e-Y) of the weightedsum of inputs:

(N )Amd = (6 WmLXId)

The predicted value at output unit p is calculated as thesigmoid function o(y) = 1/(1 + e-Y) of the weighted sum ofhidden activation values:

H\

Opd = a 7pmAm + bp)

where ZvmI and iypm are the weights connecting input unit eto hidden unit mm and from hidden unit m to output unit prespectively; and bp is the bias value for output unit p.The penalty function 9(w) is defined as follows:

9(w)=Stl (Sl'j2) + (z w~) (2)

where w is the vector of all weights in the network, wi is thevector of network weights that are connected to hidden unit i.We remove any miv. that satisfies:

(3)maxjvpmwml < 471.p

and any Vpm that satisfies:

IVpmI < 471if no weight satisfies the above criteria then remove wmL withsmallest

max IVpmWml Ip (5)

In all experiments rj is set to 0.1.The outline of the algorithm is as follows:

Algorithm NN Pruning with Cross-validation (NNPCv)Input:T: A set of training samplesC: A set of cross-validation samples

Objective: A pruned feedforward neural network with goodgeneralization capability.

Step 1 Let Nl be a network with N input units, M outputunits, and H hidden units.

(4)

Step 2.1nitialize the connection weights ofN randomly andtrain this network. Let the accuracy rates of thetrained networks on the sets T and C be AT andAC, respectively.

Step 3.Remove the weights satisfying the pruning condi-tions (3), (4) or (5) above and retrain the network.Let the new network be Ni and accuracy rates ofthe trained networks on the sets T, C be ATr, AC,respectively.

Step 41f (ATl + Aci) > (AT + AC), then1) Set X :=AI1.2) Let At := AT1,AC := Aci3) Goto Step 3

else Goto step 5Step 5.Identify hidden unit for possible removal.

1) Let min be the hidden unit which network withthat unit removed has the highest accuracy rate:

mh = argmaxm=1,2,. H (Afm+ m)

where A+4m and A&m are the accuracy rates ofthe network with hidden unit m removed overtraining and validation set.

2) Adjust the bias weight to each output unit bp byadding an amount equal to average activation ofhidden unit m4i over all training examples:

bp := bp + vphAm (6)where Am is the average value of the activa-tions of all training samples at hidden unit m

3) Let NAf, be the network with all its weightsconnecting to hidden unit mr set to 0 and theremaining weights copied from N.

Step 6.Retrain the pruned network.1) Retrain the network NAm.2) Let ATT and bACm be the accuracy rates of

NmJV, on the sets T and C, respectively.3) If (ATm±+ACj) > (A- +±Ac), then

a) Set :=NJhand H := H-1.b) Set AT := ATm and Ac := Acm.c) Goto Step 5

Step 7.Output the network N/' as the final pruned network.

To reduce the likelihood of overfitting, we add the penalty9(w) to the error function. This penalty term penalizes weightswith very large magnitude and drives small weights to zero.The latter can be removed without affecting the accuracy ofthe network too adversely.

Phase 1 of the algorithm from step 1 to step 4 is similaras our earlier pruning method [7]. It has been shown thatremoving any network weights that satisfy conditions (3)or (4) does not affect network's accuracy on the trainingdata samples. However, in our original implementation of thepruning algorithm, we did not check how the accuracy on thecross-validation is affected. Instead, a pre-set threshold valuewas required and if the accuracy of the network drops below

973

Page 3: [IEEE International Joint Conference on Neural Networks 2005 - Montreal, Que., Canada (31 July-4 Aug. 2005)] Proceedings. 2005 IEEE International Joint Conference on Neural Networks,

TABLE I

THE DATA SETS USED IN THE EXPERIMENT

Data set T Size Missing Attributes Neural Network

IValues (%) Continuous Binary Nominal Inputs Outputs

annealaudiologyaustralianautosbalance-scalebreast-cancerbreast-wgermanglassglass (G2)heart-cheart-hheart-statloghepatitishorse-colichypothyroidionosphereiriskr-vs-kplaborlymphographypima-indiansprimary-tumorsegmentsicksonarsoybeanvehiclevotevowelwaveform-noisezoo

8982266902056252866991000214163302294270155368

3772351150

319657148768339

23103772208683846435990

5000101

0.02.00.61.10.00.30.30.00.00.00.2

20.40.05.6

23.85.50.00.00.03.90.00.03.90.05.50.09.80.05.60.00.00.0

6 14 180 61 86 4 515 4 64 0 00 3 69 0 06 3 49 0 09 0 06 3 46 3 413 0 06 13 07 2 137 20 2

33 1 04 0 00 34 28 3 50 9 68 0 00 14 319 0 07 20 2

60 0 00 16 1918 0 00 16 010 2 140 0 0

1 15 0

82954573S

5110621010232514306234355

4130399

372033611351933284117

S2426322262222224232242

217221942

1137

this threshold, pruning is stopped. In this algorithm, we alsocheck the accuracy of the network on both training and cross-validation in Step 4.

After the completion of this phase, we notice that thecontribution of the hidden units to the classification accuracyvaries widely. Phase 2 attempts to identify those hiddenunits that do not contribute significantly to classification andgeneralization. This is achieved by ranking the accuracy of thenetwork with one hidden unit removed. We alleviate the affectof setting the connections to and from a hidden to zero on theoverall network accuracy by modifying the output unit bias as

given by Eq. (6) That is, we add to the bias a value that is equalto the connection weight vpm multiplied by the average valueof the hidden unit activation values of all training samples (cf.Eq. 6).As the network needs to be retrained many times during the

pruning process, it is important that an efficient method is ap-plied to find a local minimum of the error function. Unlike theback-propagation method, the quasi-Newton method convergesto a local minimum of the error function with a super-linearconvergent rate. We have used a variant of the quasi-Newtonmethod, that is the BFGS method in our implementation [9].

974

Page 4: [IEEE International Joint Conference on Neural Networks 2005 - Montreal, Que., Canada (31 July-4 Aug. 2005)] Proceedings. 2005 IEEE International Joint Conference on Neural Networks,

TABLE II

THE NUMBER OF WEIGHTS AND THE ACCURACY RATES OF THE NEURAL NETWORKS OBTAINED FROM THE PRUNING METHOI) AFrER TWO PHASES.

FIGURES SHOWN ARE THE AVERAGES AND THE STANDARD DEVIATIONS FROM 10 TEN-FOLD CROSS-VALIDATION RUNS. A BLACK DOT DENOTES THAT

PHASE 2 HAS REDUCED MORE THAN 20% WEIGHTS FROM THE NETWORK IN IHASE I

Data set Fully connected Phase I [ Final no. of weights j Phase I test acc. Test acc.

339.6± 111.29901.1 ± 138.49136.6 ± 31.78

320.61 ± 61.8947 ± 1.73

117.2 ± 17.2735.3 ± 3.95

221.3 ± 60.89158.81 20.28

66.2 ± 9.391.8 ± 17.9298.5 ± 24.22110.8 ± 11.59113 ± 25.79

294.3 ± 98.54105.7 ± 26.46231.2 ± 75.21

34 ± 6.63173.6 ± 44.555.3 ± 15.68103 32.4179.2 ± 4.64

511.7 ± 69.96224.4 ± 36.02114.9 ± 22.8757.3 4.22

388.4 ± 20.92237.7 ± 30.8676.1 ± 36.59

319.7 ± 30.41182 ± 11.77

111.4 ± 29.13

186.5 ± 74.01 -469.2 ± 45.74 -

50 ± 10.07 i293.11 ± 58.22

44.3 ± 2.5115.7 ± 17.5535.3 ± 3.95169.3 ± 50.04 -98.7 ± 12.52 -42.1 ± 2.53 i

68.41 ± 8.0686.2 ± 21.2199.2 ± 10.56100.3 ± 24.02149.1 ± 58.78 i66.1 ± 6.9767.2 ± 6.59 i23 ± 1.49

92.43 ± 8.66 i30.1 ± 10.6655.8 ± 11.35 *

71.7 ± 4.55377.4 ± 61.54 -173.78 ± 16.78 -

69.7 ± 12.352.4 ± 4.44

388.4 ± 20.92214 ± 2530.5 ± 2.68

319.7 ± 30.41182 ± 11.7765.8 ± 5.43

97.79 ± 0.5071.08 ± 2.3581.45 ± 2.0346.50 ± 5.6693.96 ± 1.0170.64 ± 1.9596.87 ± 0.6472.70 ± 1.3064.50 ± 2.6282.70 ± 2.8180.19 ± 1.1778.79 ± 1.7678.89 ± 2.8379.17 ± 2.8780.91 ± 2.7598.49 ± 0.2890.03 ± 1.3097.34 ± 1.0499.25 ± 0.1687.50 ± 3.2676.82 ± 3.3975.48 ± 1.1543.66 ± 3.2396.54 ± 0.5397.62 ± 0.1379.15 ± 2.3893.27 ± 1.0374.84 ± 1.2296.08 ± 0.7586.07 ± 1.1783.20 ± 0.8592.10 ± 2.37

98.24 ± 0.6673.97 ± 3.2385.51 ± 1.4554.00 ± 4.1895.41 ± 0.9670.64 ± 1.9596.87 ± 0.6473.10 ± 1.2664.90 ± 2.3780.20 ± 2.2182.19 ± 1.9580.86 ± 1.3978.15 ± 2.8579.84 ± 3.2480.36 ± 2.8698.73 ± 0.2390.58 ± 1.4196.67 ± 1.0699.33 ± 0.1888.34 ± 4.1578.97 ± 2.0475.49 ± 1.1043.73 ± 3.4696.97 ± 0.3897.67 ± 0.1278.65 ± 2.3193.27 ± 1.0377.78 ± 1.2695.84 ± 0.9386.07 ± 1.1783.20 ± 0.8594.00 ± 2.53

III. EXPERIMENTAL RESULTS

The performance of many new network learning and con-

struction algorithms has been evaluated on only relatively fewproblems. This is despite some criticism about this practice[10]. In order to compare the effectiveness of our proposedmethod with existing algorithms, 32 classification problems as

listed in Table I were selected. These are real-world classifica-tion problems with mixed discrete and continuous attributes.The data sets for these problems are available from the website

of the Machine Learning Research Group at the Departmentof Computer Science, University of Waikato '. They are alsoavailable from the UCI Machine Learning Repository [11].

For each data set, the experimental setting was as follows:

1) Ten-fold cross-validation scheme: we split each data setrandomly into 10 subsets of equal size. Eight subsetswere used for training, one subset was used for cross-

I http://www.cs.waikato.ac.nz/bvd/weka/index.htmI

975

anneal

audiologyaustralianautosbalancebcbreast-wgerman

glassg2hchhheartstathepatitiscolic

hypoionosiris

krlaborlymph,pima

primary

segmentsicksonar

soybeanvehiclevotevowelwave

zoo

885134424248666

2725452222311921019815520078227761084

464134312110693455218450949436434611228307

Page 5: [IEEE International Joint Conference on Neural Networks 2005 - Montreal, Que., Canada (31 July-4 Aug. 2005)] Proceedings. 2005 IEEE International Joint Conference on Neural Networks,

TABLE III

COMPARISON OF THE ACCURACY RATES OF NEURAL NETWORKS, C5.0, M5S, AND N2C2S.

Data set NNPCv j C5.0 [ M5' N2C2S

98.2473.8385.5154.0095.4170.6496.87

± 0.60± 2.20± 1.40

4.10± 0.90± 1.90± 0.60

73.10 ± 1.2064.90 ± 2.3080.20 ± 2.2082.19 ± 1.9080.86 ± 1.3078.15 ± 2.8079.84 ± 3.2080.36 ± 2.8098.73 ± 0.2090.58 ± 1.4096.67 ± 1.0099.33 ± 0.1088.34 ± 4.1078.97 + 2.0075.49 ± 1.1043.73 + 3.4096.97 ± 0.3097.67 ± 0.1078.65 ± 2.3093.27 ± 1.0077.78 + 1.2095.84 ± 0.9086.07 ± 1.1083.20 ± 0.8094.00 ± 2.50

98.70 ± 0.30 o76.50 ± 1.40 o85.30 ± 0.5080.00 ± 2.50 o77.60 ± 1.00.73.30 ± 1.60 o

94.50 ± 0.30.71.20 ± 1.00.67.50 ± 2.60 o78.70 ± 2.1076.80 ± 1.40.79.80 ± 0.90.78.70 ± 1.4079.30 ± 1.2085.30 ± 0.60 o99.50 ± 0.00 o

88.90 ± 1.6094.50 ± 0.70.99.50 ± 0.10 o78.10 ± 4.80.75.40 ± 2.80.74.50 ± 1.2041.80 ± 1.3096.80 ± 0.2098.80 ± 0.1074.70 ± 2.8091.30 ± 0.5072.90 ± 1.2096.30 ± 0.6079.80 + 1.3075.40 ± 0.5091.80 ± 1.10

98.80 ± 0.20 o76.70 ± 1.00 o85.80 ± 0.9074.40 ± 1.90 o86.40 ± 0.70.69.60 ± 2.3095.30 ± 0.30.72.90 ± 0.7070.50 ± 2.80 o81.80 ± 2.2080.90 ± 1.4079.00 ± 0.8082.20 ± 1.00 o

81.90 ± 2.2084.60 ± 0.70 o

96.60 ± 0.10.89.70 ± 1.2094.70 ± 0.7099.40 ± 0.1079.70 ± 4.6079.80 ± 1.4076.20 ± 0.8045.10 ± 1.6097.00 ± 0.2098.30 ± 0.10 o78.50 + 3.4092.50 ± 0.5076.50 ± 1.3096.20 ± 0.3081.70 ± 1.1082.00 ± 0.2092.10 ± 1.30

99.40 ± 0.10 o

79.50 ± 1.60 o84.80 ± 0.7072.30 ± 1.80 o

92.30 ± 0.8067.80 ± 2.1096.50 ± 0.2070.10 ± 1.60.65.50 ± 3.0077.90 ± 2.5082.40 ± 1.9081.60 ± 1.6077.50 ± 1.0081.90 ± 3.3078.90 ± 1.2096.80 ± 0.10.89.50 ± 2.2096.60 ± 0.6099.20 ± 0.10.91.90 ± 4.1082.90 ± 1.90 o76.00 ± 0.8045.90 ± 1.3095.10 ± 0.4097.50 ± 0.1076.50 ± 1.5092.90 ± 0.7084.20 ± 0.70 o

96.00 ± 0.4088.80 ± 1.40 o85.60 ± 0.20 o94.30 ± 1.40

validation, and one subset for measuring the predictiveaccuracy of the final pruned network. This procedurewas performed 10 times so that each subset was used as

a test set and a cross-validation set once. The averagetest set accuracy of the ten networks is reported as thefinal network's accuracy.

2) The penalty parameter values were set as follows: 61 =0.1,E2 = 0.00001 and 3 = 10.0.

3) The number of output units corresponds to the numberof classes in the data. The winner-takes-all strategy forprediction is used.

4) The number of input units depends on the numberof attributes in the data and their characteristics. Onenetwork input unit was assigned to each continuous

attribute in the data set. Nominal attributes were binarycoded. A nominal attribute with D possible values was

assigned D network inputs. A binary attribute with no

missing value required one network input unit, whilea binary attribute with missing value required 2 inputunits.

5) Continuous attribute values were scaled to range in theinterval [0, 1].

6) A missing continuous attribute value was replaced by theaverage of the non-missing values. A missing discreteattribute value was assigned the value "unknown" andthe corresponding components of the input vector Xd

were set to the zero.

7) During network training, the BFGS optimization method

976

annealaudiologyaustralianautosbalancebcbreast-wgermanglassg2hchhheartstathepatitiscolichypoionosiriskrlaborlymphpimaprimarysegmentsicksonarsoybeanvehiclevotevowelwavezoo

Page 6: [IEEE International Joint Conference on Neural Networks 2005 - Montreal, Que., Canada (31 July-4 Aug. 2005)] Proceedings. 2005 IEEE International Joint Conference on Neural Networks,

TABLE IV

SUMMARY OF THE RESULTS FROM OUR METHOD COMPARED TO THOSE

FROM C5.0, M5 AND N2C2S.

was terminated if the relative decrease in the error func-tion value after two consecutive iterations was less than10-5 or the maximum number of function evaluationswas reached. This number was set to 500.

The experimental results are presented in Table II andTable III. In Table LI we show the average number of networkweights after phase I and the final average number of weightsand the average accuracy rates of the pruned networks. In 19data sets, phase 2 of the method removed at least 20% of theconnections that were still present in the network after pruningin phase 1. For most of the data sets, the accuracy rates were

also higher after more hidden units and connections had beenremoved.We compare our results with the accuracy rates obtained

by M5 algorithm[12], C5.0 algorithm[ 13] and N2C2S neuralnetwork construction algorithm[14]. Both M5 and C5.0 are

decision tree methods for classification, while N2C2S is a

neural network construction algorithm. We used the t statisticfor testing the null hypothesis that the means from twomethods are equal and then a two-tailed test was conducted.If the null hypothesis was rejected at the significance level ofa = 0.01, we then checked if our new method NNPCv had thehigher average accuracy rate. In Table III, a higher accuracy

obtained by NNPCv is denoted by a bullet (E), while a loss isdenoted by a diamond (o), a tie is left unmarked.

Finally, a summary of how our method performs is givenin Table IV for comparison purpose. The proposed neuralnetwork pruning algorithm achieves better prediction accuracyrates compared the two decision tree methods. Compared toC5.0, which is the most widely used decision tree method,for 15 of 32 data sets the pruned neural networks are more

accurate. The number of data sets where C5.0 is better isonly 9. We can conclude that the new method can achievesignificantly better predictive accuracy than the other methodson more problems.

IV. CONCLUSION AND DISCUSSION

We have presented an effective method for pruning feed-forward neural networks. The main difference between thismethod and other network pruning algorithms is the use ofcross-validation samples. The algorithm computes the accu-

racy of the network being pruned on the training samples as

well as on the cross-validation samples to guide and stopthe pruning process. Pruning is achieved by first identify-ing individual network weights with small magnitude. Theseweights can be removed without changing the network outputsvery much. When there are no more connections that can beremoved based on their magnitude, the algorithm attempts toremove hidden units which do not contribute significantly tothe network prediction. Such hidden units are identified bythe accuracy of the network with each hidden unit removedand we may be able to remove the hidden unit if we makean appropriate adjustment to the bias value at the output unitsand retrain the smaller network.We have extensively tested our proposed method and com-

pared the results on 32 publicly available data sets withmixed continuous and discrete attributes and many classes.Our results show that compared to the decision tree methodsand our-previous neural network construction algorithm, thenew algorithm can provide better prediction accuracy for manyof these data sets.

REFERENCES[1] S. Fahlman and C. Lebiere, "The cascade-correlation learning

architecture:' Advances in Neural Information Processing Sys-tems 2, pp. 524-532, 1990.

[2] M. Mezard and J. Nadal, "Learning in feedforward layerednetworks: The tiling algorithm," Journal of Physics A, vol. 12,no. 12, pp. 2191-2203, 1989.

[3] J. Y. R. Parekh and V. Honavar, "Self-organizing network for op-timum supervised learning," IEEE Trans. on Neural Networks,vol. 11, no. 2, pp. 436-451, 2000.

[4] Y. LeCun, S. S. J.S. Denker, and L. Jackel, "Optimal braindamage," Advances in Neural Infonnation Processing II, pp.598-605, 1990.

[5] A. Krogh and J. Hertz, "A simple weight decay can improvegeneralization." Advances in Neural Information ProcessingSystems 4, 1992.

[6] A. M. S. Yasui and J. Zurada, "Convergence suppression anddivergence facilitation: New approach to prune hidden layer andweights of feedforward neural networks," IEEE Intl. Symp. on

Circuits and Systems, 1995.[7] R. Setiono, "A penalty function approach for pruning feedfor-

ward neural networks:" Neural Computation, vol. 9, 1997.[8] K. Lang and M. Witbrock, "Learning to tell two spirals apart,"

Proceedings of the 1988 Connectionist summer school, 1988.[9] J. E. Dennis and R. B. Schnabel, "Numerical methods for

unconstrainted optimization and nonlinear equations," PrenticalHalls, Engelwood Cliffs, NJ, 1983.

[10] L. Prechelt, "A quantitative study of experimental evaluation ofneural network leraning algorithms," Neural Networks, vol. 9,pp. 457-462, 1996.

[11] E. K. C. Blake and C. Merz, "Uci repository of machinelearning databases:" Irvine, CA: Uni of California, Departmentof Information and Computer Science, 1998.

[12] S. 1. G. H. E. Frank, Y. Wang and I. Witten, "Using model treesfor classification," Machine Learning, vol. 32, no. 1, pp. 63-76,1997.

[13] R. Quinlan, "C4.5: Programs for machine learning." MorganKaufman, San Mateo, CA, 1993.

[141 R. Setiono, "Feedforward neural network construction usingcross-validation:' Neural Computation, vol. 13, no. 12, pp.2865-2877, 2001.

977

NNPCv versus Wins (-) Ties Losses(o)

C5.0 15 8 9

M5' 11 14 7

N2C2S 9 16 7