An experimental evaluation of neural networks for classification

Computers Ops Res. Vol. 20, No. 7, pp. 769-782, 1993 0305-0548/93 $6.00 + 0.00 Printed in Great Britain. All rights reserved Copyright 0 1993 Pergamon Press Ltd

AN EXPERIMENTAL EVALUATION OF NEURAL NETWORKS FOR CLASSIFICATION

VENKAT SUBRAMANIAN,‘? MING S. HUNG’$ and MICHAEL Y. Hu2§l

‘School of Business, University of Wisconsin-Parkside, Kenosha, WI 53141 and ‘College of Business Administration, Kent State University, Kent, OH 44242-0001, U.S.A.

(Received May 1992; in revised form September 1992)

Scope and Purpose-This paper includes the analysis of classical methods and neural networks for classifying multi-dimensional objects. Experiments were designed and run to uncover the difference in performance between the methods under various conditions. The classification problems investigated include objects of two or three variables (dimensions) from two or three groups.

The purpose of this paper is 2-fold. The first is to introduce the application of neural networks to classification. And the second is to compare the performance of neural network classifiers to the classical statistical classifiers under the ideal conditions for the latter.

Abstract-Artificial neural networks are new methods for classification. In this paper, we describe how to build neural network models. These models are then compared with classical models such as linear discriminant analysis and quadratic discriminant analysis. While neural network models can solve some difficult classification problems where classical models cannot, the results show that even under best conditions for the classical models, neural networks are quite competitive. Furthermore, neural networks are more robust in that they are less sensitive to changes in sample size, number of groups, number of variables, proportions of group memberships, and degrees of overlap among groups.

INTRODUCTION

‘Classification’ here refers to the process of assigning an object to an appropriate class, based on a set of attributes describing the object. Gordon [l] shows that classification has been used in fields as diverse as taxonomy, psychology, nosology, archaeology, pattern recognition, linguistics, and market research. Other applications include credit scoring [2], prediction of business failures [3], of tender offer outcomes [4] and of credit card usage [S].

Most of the classification procedures rely on the theory of Bayesian decision rule [6] which minimizes the expected rate of misclassification. If the attributes are normally distributed, then the Bayesian classifier becomes the classical linear discriminant analysis (LDA) by Fisher [7] and quadratic discriminant analysis (QDA) by Smith [S].

In this paper we present another approach for classification based on artificial neural networks. Lippmann [9] shows that neural networks can produce not only continuous discriminant functions but also functions that define arbitrary, nonconvex regions. Neural networks have been used for such complex classification problems as fault detection [lo], identification of two interleaving spirals [ll], handwritten character recognition [12,13], finger print processing [14], real-time signal processing [15], speech recognition [16], detection of explosives in airline baggage [17], bond rating [18], and bank failure predictions [19]. Given that neural networks can solve such complex problems, an interesting question is whether they can perform as well when the conditions are appropriate for the classical methods. This study attempts to answer this question by conducting an extensive experiment involving two and three group classification problems of two and three variables.

tV. Subramanian received a Ph.D. in Business from Kent State University. His research interests include neural network theory and applications, object oriented systems, database theory and applications of AI to business.

:M. S. Hung is Professor of Operations Research in the Graduate School of Management at Kent State University. His primary research interests include network and other mathematical programming, neural networks and communication networks. His publications have appeared in Operations Research, Naval Research Logistics Quarter/y, and others.

SM. Y. Hu earned his Ph.D. in Management Science from the University of Minnesota and is currently Associate Professor of Marketing at Kent State University. His research interests include applications of neural networks, applied statistics, marketing research and international business.

CAuthor for correspondence.

169

110 VENKAT SUBRAMANIAN et nl.

TRADITIONAL CLASSIFIERS

Statistical methods

The multivariate discriminant analysis assumes that each object is associated with a vector x of attributes which are multi-variate normal. (We shall use x to mean both the attributes of an object and the object itself.) For group j, let %j be the vector of attribute means and Sj be the variance-covariance matrix among the attributes. The pooled covariance matrix (where all the objects are considered as one group) is denoted by S. If the covariance matrices are equal, i.e. Sj= S for every group j, then Fisher’s LDA [7] maximizes the likelihood of correct classifications. The assignment of object x is based on the ‘Wald-Anderson’ statistic which is defined for every pair of groups i and j and is denoted by W, [20]:

Wij=x’S-‘(,i-,j)-~(jzi-~j)~S-‘(jzi-jzj) (1)

The superscript t denotes vector transpose. Object x is assigned to population i if W,>O for all i#j. The ‘separation’ functions are the hyperplanes where W,=O for all pairs i and j.

When the covariance matrices are unequal, Smith [S] showed that the discriminant function that maximizes the likelihood of correct classification is quadratic. The discriminant score for object x in a two-group problem is [21]:

Q=xTx+Bfx-a

where

l-=s;‘-s;‘,

fi=2(S;‘x, -S,‘%,),

a=ln!ZL!+~~S;l~ _X’S-lg

IS21 1 22 2.

S1 and S2 are respectively the covariance matrices of population 1 and 2, and IA/ denotes the determinant of matrix A. The separation function is the set of points on Q=O. As Rubin [21] indicated, if the covariance matrices are equal then the function Q becomes linear. For problems involving more than two groups, there is no statistic similar to the Wald-Anderson’s. Procedure DISCRIM of SAS [22], the procedure used in this study for LDA and QDA, defines a ‘generalized square distance’ between an object and the mean of each group and assigns the object to the nearest group.

Nonparametric models

When the distributions of the attributes cannot be specified or do not fit known functions, nonparametric methods are used for classification. These include linear programming models by Freed and Glover [23], Bajgier and Hill [24], and Markowski and Markowski [25]; mixed integer programs by Bajgier and Hill [24] and Gehrlein [26]; and nonlinear programs by Stam and Joachimsthaler [27]. In addition, models such as k-nearest-neighbor and Parzan window [6] are also popular.

Baigier and Hill [24] compared Freed and Glover’s linear programming models and a mixed integer programming (MIP) model of their own to Fisher’s linear discriminant analysis. The experimental subjects are two-group problems of normally distributed variables with varying degrees of overlap. They reported that Freed and Glover’s MSD (minimizing the sum of deviations) model and MIP did better than LDA when the group overlap is high and worse when overlap is very low. Freed and Glover’s model MMD (minimizing the maximum deviation) was dominated by MSD in all cases.

Joachimsthaler and Stam [28] evaluated LDA, QDA and MSD for data from a variety of normal and non-normal distributions. They reported that MSD is increasingly better than LDA and QDA with the increasing violation of the normality assumption (as measured by increasing kurtosis). As expected, LDA was the best when both normality and equal variance assumptions hold. QDA was best when normality assumption is true and the variances are unequal.

An experimental evaluation on neural networks for classification 771

The above discussions indicate a weakness among the traditional models in that they work best only when conditions are exactly right. A modeler needs a good understanding of both problem properties and model capabilities. Every time the problem properties change, model selection must be repeated. The following section will show how neural networks can be used to overcome these problems.

NEURAL NETWORKS

The ideal of neural computing can be traced back over 50yr ago to McCulloch and Pits [29] who developed simple networks, called perceptrons, for pattern recognition. Today, artificial neural networks are abstract models. For more details on neural networks and their applications, see [3&32].

Let G = (N, A) denote a neural network where N is the node set and A is the arc set. Each arc in A is directed. We assume that G is acyclic in that it contains no direct circuit. The node set N is partitioned into three subsets: N,, N,, and N,. N, is the set of input nodes, N, is that of output nodes, and N, that of hidden nodes. In a popular form called the multi-layer perceptron, all the input nodes are in one layer, all the output nodes are in another layer, and the hidden nodes are distributed into several layers in between. The knowledge learned by a network is stored in the arcs and the nodes, in the form of arc weights and node values called biases. We will use the term k-layered network to mean a layered network with k-2 hidden layers (some authors do not include input nodes as a layer).

When a pattern is presented to the network, the attributes of the pattern ‘activate’ some of the neurons. Let a! represent the activation value at node i corresponding to pattern p of input signals. If i is an input node then a: is the value of the ith variable in the input pattern. If i is not an input node, then a; is a function of the total input to the node, NET::

a; = F(NET;)

where F is called the activation function and NET? is defined as:

NET; = C w,& + 8, k

where Wki is the weight of arc (k, i) and 8i is the bias of node i. The activation function is typically a ‘squashing’ function that normalizes the input signals so that the activation value is between 0 and 1. The most popular choice for F is the logistic function. The activation value, or output, at node j using logistic activation function is given by:

Since the network G is acyclic, even when it is not layered, computation can proceed from the input nodes, one layer at a time, toward the output nodes until the activation values of all the output nodes are determined.

To simplify notation, we shall define for each node j an extra arc (0,j) such that woj= Hj. Let w denote the vector of all weights. Vector w is determined by ‘training’ the network with a set of training patterns. For pattern p, let EP represent some measure of the errors:

EP’ c Iup-tfl’ ieN,,

where tP is the target value for output node i and e is a non-negative real number. The objective is to minimize IpEP. A popular choice is the least square problem where {=2.

When there are two groups, we use only one output node with target value = 0 for group 1 and target value= 1 for group 2. The target values are arbitrary, but our choices seem to be typical. When there are k groups, we use [log,k] output nodes, where [a] is the least integer 20. The most popular algorithm for training is called back-propagation [33]. Recently, we showed that training can be vastly improved with known nonlinear optimization algorithms [34,35]. Since network training, as shown above, is a nonlinear (unconstrainted) minimization problem, well known algorithms such as conjugate gradient or quasi-Newton can be more efficient and reliable

772 VENKAT SUBRAMANIAN et al.

than back-propagation. See Fletcher [36], for example, for an introduction to nonlinear optimization. For training the networks presented in this paper we used our own GRG2-based system. GRG2 is a widely distributed nonlinear programming software [37]. For details of our system, see Subramanian and Hung [35].

Denton et al. [38] used a three-layer perceptron for classifying two-variable objects into two groups. Each network has two input nodes, five hidden nodes and one output node. Back-propagation was used to train the networks. The variables are normally distributed. Sets of problems of varying degrees of overlap were generated and in one set an outlier was added. The results showed that the neural network classifier performed better, in terms of correct classifications, than LDA, QDA and MSD. Moreover, the neural network model was not sensitive to the outlier whereas the other models were very sensitive. For the well-known Fisher’s Iris classification problem, the neural network model was able to correctly classify 149 out of 150 observations, equalling the best performance reported so far using mixed integer programs [26].

Huang and Lippman [39] reported on an extensive comparative study of neural networks and traditional classifiers. There are three key findings. First, neural networks can perform as well as Bayesian classifiers. Second, neural networks are sometimes even better than classical classifiers in the presence of outliers and when statistical assumptions are violated. Third, the performance of neural networks is best when the network structure matches that of the problem.

Very recently, Tam and Kiang [19] compared the ability of neural networks with that of both parametric and nonparametric classifiers (discriminant analysis models, k-nearest neighbors, and several others) to predict bank failures. The data were real and most of them did not fit normal distributions even after data transformations. The results indicate that neural networks are more accurate than all the other methods.

It is fair to say that on real data neural networks tend to perform better in classification because they are more general than the other methods. Our study takes off from the first finding of Huang and Lippman [39] and examines more closely how competitive neural networks are under the best conditions for classical models. It constitutes a well planned experimental design incorporating five factors that may have an impact on the classification rate.

DESIGN OF COMPARATIVE STUDY

To evaluate the aptness of neural networks for classification under optimal conditions for the classical LDA model, we consider five factors: number of groups, number of variables, variance values, proportion of observations from each group, and sample size. In the following, we will explain each factor in some detail. There are two reasons behind this design. One is obviously for comparing the performance of the various models. In addition to LDA, QDA was included to provide an alternative for the statistical approach. The other reason is to measure the ‘robustness’ of each method. For this, the sensitivity of each method with respect to problem characteristics will be modeled using analysis of variance. While variance and proportion are typical parameters in classification problems, we feel that the number of variables and the number of groups contribute to the complexity of a problem.

From now on we shall use ‘case’ to represent a problem type. For example, we will use case 1 to represent the two-group two-variable problems, case 2 to represent the two-group three-variable problems, and so forth. For each case, a 3 x 3 x 3 factorial experiment with variance, proportion, and sample size is used. (See Box et al. [40] for an introduction to experimental design.) The observations within each of the 27 factor combinations constitute training samples as specified by these treatments. NN, LDA and QDA will be applied onto each of these samples and the three techniques will be compared in terms of correct classification rate.

Figure 1 shows that the means of the three-group problems in two variables are respectively (10,13.66), (5,5) and (15,5). The locations are arbitrary and the only reason for this design is to keep the Euclidean distance between centers at 10. For two-group problems in two variables, the centers are (5,5) and (15,5); in other words, group 1 is deleted. For both two-group and three-group problems in three variables, the third dimension for the centers is fixed at 5.

Each variable is assumed normally distributed and all the variables have equal variance. Since the means are fixed, as variance increases there is increasing overlap among the observations from


3 Qroup bivariate population

15 -

cu

0) 3 Group 1

.o 10 -

;i >

Group 3

Fig. I. Three group bivariate population.

8 Group 2 Group 1 Group 3

tow overiop between group*

@ Group 2 Group 1 Group 3

Meeturn 0~~80~ betum group*

High ov~rlop between groups

Fig. 2. Degrees of overlap between groups.

different groups. There are three levels of variance for each variable: 2.79, 9.00 and 25.00. These values are selected so that the amount of overlap is, respectively, low (points from each group for a two-group problem have about a 0.001 probability to fall in the overlap area), medium (probability is about 0.05), and high (0.16). Figure 2 shows the amount of overlap for the three-group two-variable case, at k 3a limits for the three variance values. Since the variables satisfy all the assumptions for

114 VENKAT SCJBRAMANIAN et al.

Table 1. Sample and test set labels

Variance

TWO-gr0up Three-group 2.79 9.cQ 25.00

Proportion of group It 0.1 0.1 Set 1 set 4 set 7 0.3 0.2 Set 2 set 5 Set 8 OS 0.3 Set 3 Set 6 Set 9

Low Medium High Degree of overlap

tProportion of group 2 is held at 0.3 for the three-group cases

linear discriminant analysis, the hyperplanes where the Wald-Anderson statistics W,=O [see equation (l)] would be the ‘optimal’ separation functions in the sense that they provide the best separation for the observations in the populations.

Another factor considered is the proportion of sample objects from each group. Since each observation is weighted equally, different proportional representation in the sample may yield different separation functions. For two-group cases, the proportion of group i has three levels: O-1,0.3 and OS. For three-group cases, the proportion from group 2 is held fixed at 0.3 and that from group 1 has three levels: 0.1, 0.2 and 0.3.

The nine factorial combinations of factors variance and proportion are nested in each of the four cases. Table 1 shows how the combinations are labeled. In each combination, five samples of a given size are randomly drawn and used to build an LDA, a QDA, and a neural network model. Each model is then evaluated on an independent test (hold-out) set of 300 objects drawn from the same combination. From now on, combination and set are synonymous. The test sets are surrogates for populations. This evaluation is equivalent to situations where populations have finite size and sample are purely random. The inclusion of QDA is justified by the fact that sample data may not have equal variance among different groups, even though the population data do.

The sample sizes used are 10, 30 and 90. The choice was based on a preliminary study which showed that samples smaller than 10 yielded unreliable (large variance) results for LDA and QDA. In fact, LDA and QDA require at least two objects from each group; hence with sample size 10 and proportion 0.1 they will not be able to produce the discriminant function. The preliminary study also showed that samples over 90 did not yield results better than samples of 90. The size of 30 was chosen because it seemed to be near the boundary between ‘small’ and ‘large’ samples.

The neural network models are all fully-connected networks of three layers. Let n be the number of variables and k be the number of groups. Each network has n nodes, k hidden nodes and [log,k] output nodes. As Huang and Lippman [39] showed, the performance of neural networks is best when its structure fits that of the problem. The more significant aspects of the structure are the number of hidden layers and the number of hidden nodes. The current literature indicates that for almost all applications one hidden layer is sufficient and that the number of hidden nodes should be around 2n + 1 or less. We experimented with several choices of hidden nodes from k to 2n + i . We found out that as hidden nodes increase the classification rate increases in the training samples but decreases in the test sets. Since our main measure of performance is classification rate on the test set, we settled on k hidden nodes. For the output nodes some people prefer k nodes, one for each group, We also experimented on this and found little difference in performance. The objective function is the sum of squared errors and the activation function at each node is the logistic function. The training algorithm used is our GRG2-based system [35]. The initial weights for getting the training algorithm started are drawn randomly from the uniform distribution ( -2.5,2.5).

The measure of performance by each classifier is the percentage of correctly classified observations in the test sets. For the neural network, an object is correctly classified if the activation value of every output node is within 0.5 of the target value,

SAS procedure DISCRIM [22] was used as the LDA classifier and the same procedure with the option to not pool the variance-covariance matrices was used as the QDA classifier. As mentioned before, PROC DISCRIM computes a measure called ‘generalized square distance’ for the assignment of an object. For two-group problems, this measure leads to the same assignment as the Wald-Anderson statistic.


RESULTS

The outcomes are summarized in Tables 2-5. Each number is the mean percent correct classifications based on the five observations of each combination. For runs with sample size 10, LDA and QDA did not produce any results for sets 1,4 and 7 since one group has only one object in the sample. We will use ‘observation’ to mean the set of the percentages of correct classification in each replication. In other words, there are five observations of each sample size in each of the nine sets in each case, except for those sets where LDA and QDA had no results.

Comparisons

A number of interesting patterns can be discerned from the tables. First of all, one can see differences between the performance of LDA and QDA. For models built from small samples (size 10). LDA is better in all cases and all sets where both models produced results. An explanation is that by not pooling the variancecovariance matrices, QDA is more sensitive to variations in them and in small samples one would expect greater variation in those matrices. For larger samples, the variation is smaller and the two methods are practically equal in performance. This is to be expected since, as Rubin [21] indicated, QDA includes LDA as a special case.

Secondly, for large samples (size 90) where LDA and QDA should be nearly optimal since assumptions of normality and equal variances should be almost exactly satisfied, the neural network

Table 2. Percentage of correctly classifications (two-group two-variable)

Test set LDA

Sample size = 10 Sample size = 30 Sample size = 90

QDA NN LDA QDA NN LDA QDA NN

1 2 100.00 3 100.00 4 5 9it.67 6 97.61 7 - 8 91.00 9 86.00

100.00 100.00

95.33 93.67

85.33 14.33

100.00 100.00 100.00 100.00 100.00 1CWO 100.00 lOO.cm 100.00 1cKWO ltxl.00 100.00 100.00 loO.CNJ 100.00 100.00 100.00 100.00 100.00 100.00 1CCMO 95.33 99.00 99.00 98.33 99.00 99.00 96.33 98.00 98.67 98.67 91.61 99.00 99.33 95.61 97.61 98.00 97.61 96.33 98.00 98.00 97.33 93.33 95.61 95.61 93.00 97.00 96.61 94.61 90.33 91.00 88.61 81.61 91.61 91.61 90.61 90.33 89.00 86.00 85.33 89.61 88.33 84.33

Table 3. Percentage of correctly classifications (two-group three-variable)

Test set


LDA QDA NN LDA QDA NN LDA QDA NN

100.00 100.00 95.61 100.00 100.00 100.00 100.00 100.00 81.61 lOO.CMl 100.00 1CQ.00 100.00 100.00 ICKMO loo.00 100.00 99.61 100.00 100.00 100.00 100.00 l@XOO 100.00 100.00

84.00 99.00 99.33 94.33 93.61 98.67 99.00 94.61

9l.cm 91.33 99.00 99.00 91.33 99.00 98.61 94.33 96.00 94.33 98.33 98.33 98.33 95.67 98.33 98.33 95.61

94.00 91.00 92.61 94.33 96.61 97.00 96.33 88.00 19.33 90.33 90.00 90.33 87.33 91.33 91.61 92.00 82.00 80.61 85.33 84.61 85.33 82.33 91.33 88.33 89.00

Table 4. Percentage of correctly classifications (three-group two-variables)

Test set

1 2 3 4 5 6 7 8 9

LDA

IOOMJ 100.00

92.33 93.61

67.33 14.33


QDA NN LDA QDA NN LDA QDA NN

94.00 100.00 1OmO 100.00 100.00 100.00 100.00 53.33 IcO.00 100.00 lOO.CKl 100.00 100.00 1cKl.00 100.00 82.33 100.00 100.00 100.00 100.00 100.00 100.00 100.00

w.oO 81.33 92.61 94.33 94.00 94.00 95.00 95.61 93.33 94.00 93.33 94.33 95.33 95.00 92.00

15.61 94.00 94.61 94.33 96.00 95.33 94.61 94.33

26.33 11.61 74.33 81.61 76.33 76.33 81.33 75.00 81.61 80.33 19.67 16.61 79.67 78.33 17.61

64.00 78.33 78.00 77.61 11.33 81.00 18.00 76.33

776 VENKAT SUBRA~ANIA~ et al.

Table 5. Percentage of correctly classifications (three-group three-variable)

Test Sample size = 10

---_____- Sample size = 30

_I_-~ Sample size = 90

.--.

set LDA QDA NN LDA QDA NN LDA QDA NN

90.33 86.00 98.00 91.33

71.67 69.33 87.00 79.00

61.67 56.00 68.33 65.67

95.00 1cmo 95.61 100.00 100.00 99.67 100.00 100.00 100.00 100.00 100.00 100.00 IOOSQ 100.00 100.00 100.00 100.00 10mO 100.00 100.00 loO.cQ 87.67 92.67 87.67 91.67 97.67 94.00 93.00 92.33 92.67 93.61 93.33 95.67 94.67 93.67 93.67 93.00 92.00 95.33 96.00 94.67 94.67 77.33 78.67 72.67 76.67 80.67 81.33 77.33 83.33 74.33 74.00 76.67 78.00 19.33 78.00 75.67 72.67 73.00 75.33 78.co 77.33 77.00

models are not far behind. On the other hand, when samples are small, neural network models are clearly better.

For test sets 1-3 where there is little overlap among groups, all three methods yielded nearly perfect classifications in all four cases. LDA was perfect in all but sets 2 and 3 of the three-group-three-variable case and neural networks were perfect except in set 1 (where LDA produced no results) of the three-group cases. But the high percentages indicate that both LDA and neural networks are very adept at ‘generalizing’ from sample data. For sets 4-6, LDA is better than QDA, albeit only slightly, in all sets and cases except in set 5 and sample size 90 in Table 2, set 4 sample size 90 in Table 3, set 4 sample size 30 in Table 4, and set 5 sample size 30 in Table 5. For sets 7-9, LDA again is better than QDA in most cases and sets.

Gilbert [41] and Marks and Dunn [42] indicate that at least four factors have an impact on the relative performance of LDA and QDA-the differences in dispersion, the number of variables, the separation among the groups, and sample size. With small sample sizes, small number of variables, and broadly similar dispersion, the quadratic form would not perform as well as the linear rule. The reasons are as follows. When the dispersion (i.e. covariance) matrices are similar, then little is lost by averaging over individual groups. For small sample sizes, the degree of freedom for individual group dispersion matrices used in the quadratic rule are fewer than those associated with the pooled within group dispersion matrix used in LDA. Therefore, because nearly twice the number of parameters need estimation in the quadratic form, sample size consideration can dominate.

It is clear that as sample size increases, the performance of LDA and QDA improves. This is true in most of the sets and cases. This is true even when we discard sample size 10 from comparison. On the other hand, the effect of sample size on the performance of neural networks is not significant as the percentages stay quite close together. The direction of the effect is not clear either. In some situations (e.g. set 9 in Table 2 and 3), increasing sample size actually leads to degrading performance.

Our idea that the number of variables and the number of groups contribute to the complexity of a problem is supported by comparing the performance of all three methods across cases. For example, from Table 2 to 3 where the number of variables increases by one, the performance is lower. Similarly, from Table 2 to 4 or from Table 3 to 5 where the number of groups increases by one, we see the same pattern.

To get a clearer comparison between neural network and discriminant analysis models, we computed the differences (in percentage of correct classifications) between the better of LDA and QDA (which will be called DA for discriminant analysis) and neural networks (which will be shortened to NN). The summarized resuls are shown in Table 6. Each number in a column labeled ‘Mean’ is the average difference of DA minus NN models, averaged over all the other factors. For example, the value of 0.333 in column 1 and case 1 means that DA is 0.333% higher than neural networks in correctly classifying objects in all two-group two-variable problems with low overlap. The t-value is the ‘studentized’ difference, equal to the mean difference divided by the sample standard deviation. The P-value is the probability that we can find in the t-distribution an observation as extreme as the computed value. In other words, it is the probability of committing the ‘Type I Error’ of concluding nonzero differences in the population means when the difference is actually zero.


Table 6. DitTerence in classification performance between the better of LDA and QDA and the neural network classifier for different cases

____-

LOW __._ _-.__-.--

I Mean (Pvalue)

Degree of overtap _________ ----~

Medium High

I Mean ;P”al”e) Mean (P value)

Overall (all overlaps)

_-- - ._..____. - .__-._

I

Mean (P value)

Two-group Two-variable rwo-group Three-variable Three-group Three-variable Three-group Three-variable Overall (cases)

0.3330 3.21 1 XXI88 5.64 2.0743 4.31 1.3377 6.16 (0.0023) (0.0001) (O.O@x) (0.0001)

0.2805 2.68 2.5648 5.34 0.8495 2.19 I .2316 5.41 (0.0108) (O.Otxl1) (0.0342) (0.0001)

0.2805 2.79 0.2065 0.64 0.6728 0.65 0.3866 1 .Ol (0.~82) (0.3270) f0.5201) (0.2862)

-1.1123 -2.28 -2.5910 -2.10 - 3.0520 -2.44 -2.2518 -3.71 (0.0280! (0.04241 (0.0192) (O.OcO3)

-0.0553 -0.40 0.4423 1.18 0.1361 0.30 0.1760 0.87 (0.6874) (0.2384 t (0.7656) (0.3849)

First, let us look at the overall differences. The grand mean of 0.176 indicates that over all the observations in our data set, DA is slightly better but the difference of less 0.2% shows that the two approaches are practically identical in performance. This interpretation is supported by the large P-value. The overall means across degrees of overlap also indicate no significant differences between the two approaches. However, the overall means across the cases show a different picture. For the two-group cases, DA is significantly better although the difference is still very slight, around 1.3%. On the other hand, for the three-group three-variable case, NN is significantly better and with a higher difference. This implies that for simple classification problems, the DA approach is an excellent choice. But as the problem becomes more complicated, NN is a better choice.

This inte~retation about the complexity of problem and the choice of approach is borne out in the analysis of differences inside the table. For two-group two-variable problems, the DA approach is significantly better and, interestingly, the mean difference increases with increasing overlap. When the number of variables increases to three, the differences become small in all degrees of overlap, although they are still statistically significant. When the number of group changes to three, the DA approach is still better when there is low overlap and the difference becomes insignificant with increased overlap. In the three-group three-variable case, NN is significantly better and the differences are the largest in each degree of overlap.

Ana(v.Gs of facfor eflect.7

For a better understanding of where the differences occur, we performed an analysis of variance, using PROC ANOVA of SAS, with proportion and sample size as factors. The results are shown in Table 7. The F values are ratios of sum of squared differences among observations. The P-value (parenthesized) is the probability of committing Type I error for concluding that the differences are not zero for all levels of a factor when the differences are actualiy zero in all levels. Since proportion and sample size have three levels, there are nine levels for the interaction. When the P-value is small, we can say that the factor has an effect on the dependent variable which, in this application, is the difference in percentage of correct classifications between the discriminant approach and neural networks.

The first impression one gets after analyzing Table 7 is the prevalent amount of significant effects. Except for problems involving low overlap, proportion, sample size and their interaction all have effects on the difference in performance. For example, looking at the overall statistics over all degrees of overlap, proportion has an effect in three out of four cases, meaning that the differences in ~rfo~an~e are not uniform across levels of proportion. Sample size is significant in all four cases and, judging from the larger F ratios, we can conclude that the non-uniformity across sample sizes is greater. This confirms with our impression from reading Table 6 where we saw that the DA approach is more sensitive to the sample size. The interaction factor having an effect poses some difficulty in explanation. What the results indicate are that the differences are not uniform across the combinations of proportion and sample size. However, the F ratios are smaller in magnitude than those of the individual factors. This would mean that the interaction effects are not as important as the main effects.

Tab

le

7.

Eff

ect

of s

ampl

e si

ze,

prop

orti

on

and

thei

r in

tera

ctio

n on

cla

ssif

icat

ion

perf

omun

ce

F an

d P

val

ues

Deg

ree

of o

verl

ap

<

2 F

Low

M

ediu

m

Ove

rall

_)

Hig

h (a

ll o

verl

aps)

F

Cas

e P

ropo

rtio

n Sa

mpl

e si

ze

Inte

ract

ion

Pro

port

ion

Sam

ple

size

In

tera

ctio

n P

ropo

rtio

n Sa

mpl

e si

ze

Inte

ract

ion

Pro

port

ion

Sam

ple

size

In

tera

ctio

n $ $

Tw

o-gr

oup

0.94

1.

13

0.32

2.

02

7.10

2.

20

Tw

o-va

riab

le

(0.4

03)

2.55

(0

.335

) 69

.16

(0.8

14)

1.36

16

.70

El,

%

Tw

o-gr

oup

0.48

(0

.149

) 1.

81

(0.0

03)

(0.1

08)

(0.0

94)

(0.0

01)

(0.2

62)

(O.tw

(0

%)

g 0.

17

21.1

1 T

hree

-var

iabl

e (0

.623

) (0

.181

) 67

.97

(0.9

17)

(6.0

01)

(O=)

$4

, 25

.16

(o.3

G

0.87

Thr

ee-g

roup

0.

81

0.71

0.

38

(0.~

) 1.

88

(OY

*,

(23,

16

6.03

(O

-IW

(0

.459

2)

z

Tw

o-va

riab

le

(0.4

56)

(0.4

99)

(0.7

71)

337.

11

43.0

3 (0

.169

) (0

:;

(zfo

, 21

A8

Thl

WgK

Nlp

64

.90

161.

39

49A

2 w

w

114.

50

(O.lw

(0

.~)

(OZ

3,

ww

(t

z2,

.

Thr

ee-v

aria

ble

(0.0

00)

(@.@

@l)

w

w

591.

76

91.8

7

(o-)

&

z)

z&

171.

69

gs)

33.3

4

Ove

rall

ww

(O

JW

:&,

ww

(@

.lw

(0.G

3.

70

(c-N

75

32

(0.Z

) 21

2 5.

41

(0.0

13)

ww

(0

.100

) (@

.@lI

l)


Looking at the factor effects across degrees of overlap, the significance of both factors and their interaction are even more striking. One can see that the F ratios of both factors are increasing with increasing overlap. One implication is that as the classification problem becomes more complex (in terms of overlap), controlling proportion and sample size becomes more critical. This impression is also supported by the large F ratios in the three-group three-variable case.

To explain the patterns of factor effects in Table 7, we carried out ANOVA studies on the percentages of correct classification by DA and NN models separately. The results are shown in Tables 8 and 9. For Table 8, the percentages used are the higher between LDA and QDA. The factors are the same: proportions of group memberships and degrees of overlap. The P-value is the probability of committing Type I error by concluding that the percentages are unequal for all levels of the factors when they are actually equal.

The results in Tables 8 and 9 can be used for verifying the patterns observed in Table 7 and also for measuring the robustness of both approaches. Looking at where the factor effects are significant (say, at the 5% level), we see that the factors and their interaction have a similar pattern of impact on the performance of both DA and NN models. Namely, the effects of both factors increase in significance as overlap increases and as the number of variables and the number of groups increase. But, most of the F ratios in Table 9 are much smaller than the corresponding F ratios in Table 8. Two points can be made from these results. First, the NN models are less sensitive to the five factors. In other words, the performance of NN models is more robust, compared to DA models, across the problem characteristics (i.e. factors) studied in this experiment. This point is reinforced by the insignificant F ratios in the overall columns and the overall row in Table 9. Second, the pattern observed in Table 7 where the differences in performance between the two approaches increase with increasing problem complexity are caused, to a small extent, by the difference in the response patterns and, to a much greater extent, by the difference in the sensitivity of the approaches to the problem characteristics. The response pattern and the sensitivity of the DA models confirm the previous findings [21,24,27]. The consistently high level of performance and the relative insensitivity of the NN models make them attractive candidates for classification.

CONCLUSIONS

We presented a study comparing the performance of discriminant analysis (LDA and QDA) and neural network models on classification problems designed for the discriminant analysis approach. The problems include four cases; combinations of two and three groups, and two and three variables. The variables are normally distributed with equal variances across groups. Three factors considered in the experiment are sample size, proportion of group memberships, and degrees of overlap. The results of the experiment can be summarized in the following:

(1) All three methods-LDA, QDA and neural networks-are fairly equal in their ability to classify objects under the conditions most favorable to LDA, except when sample size is small, in which case neural networks are much better and QDA is much worse.

(2) LDA and QDA improve their performance with increasing sample size. The effect of sample size on neural networks is not as significant.

(3) The discriminant analysis (DA) approach is better when the problem is simple-small number of variables and groups. When the problem is more complex, neural networks (NN) are better.

(4) The difference in performance between the DA and the NN models is not uniform across proportions of group memberships, degrees of overlap, and sample sizes, and their interactions.

(5) Both DA and NN models have similar response patterns to the five design factors. However, NN models are less esnsitive (more robust) to all the factors.

In addition, based on previous studies [9,19,39] and our own experience, some general guidelines can be stated for building neural network models:

(6) One hidden layer seems sufficient for most classification tasks.


(7)

(8)


The number of hidden nodes depends on the objective of study. For greater classification rate on a given set of data, more hidden nodes lead to better performance. For greater generalizability, however, more hidden nodes may lead to worse performance. Our preference is to start small and increase the hidden nodes one at a time. The choices offered in this study are the best for generalization for this data set. The number of output nodes is not critical in this study.

REFERENCES

1. A. D. Gordon, C~as~~~cfff~~~. Chapman & Hall, London (1981). 2. N. Capon, Credit scoring systems: a critical analysis. J. Murk& 46, 82-91 (1982). 3. S. Sharma and V. Mahahan, Early warning indicators of business failures. J. ~urkef. 44, 8@89 (1980). 4. R. A. Walking, Predicting tender offer success: a logistic analysis. J. Finance Qlcanr. Analysis 20, 461-478 (1985). 5. R. Y. Awh and D. Waters, A discriminant analysis of economic, demographic, and attitudinal characteristics of bank

charge-card holders: a case study. J. Finance 973-980 (1983). 6. P. 0. Duda and P. E. Hart, Pattern ClassiJication and Scene Analysis. Wiley, New York (1973). 7. R. A. Fisher, The statistical utilization of multiple measurements. Annals Eugen. 8, 376-386 (1938). 8. C. A. B. Smith, Some examples of discrimination. Annals Eugen. 13, 272-282 (1946). 9. R. Lippmann, An introduction to computing with neural nets. IEEE ASSP Msg. 4, 2-22 (1987).

10. J. C. Hoskins, K. M. Kaliyur and D. M. Himmelblau, Incipient fault detection and diagnosis using artificial neural networks. Prof. Inf. .loint Co& Neural ~ef~orks I, 81-86 (1990).

t 1. K. I. Langand M. J. Witbrock, Learning tote11 twospiralsapart. Proc. Connect. ~o~~sS~~~er~e~. 52-59 (1988). 12. Y. Le Cun, 8. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubberd and L. D. Jackel, Handwritten digit

recognition with a back-propagation network. Adv. Neural inform. Process. Sysi. 2, 396-404 (1990). 13. G. L. Martin and J. A. Pittman. Recognizing hand-printed letters and digits. Adv. Neural Irzform. Process. Sj’st. 2,

405-414 (1990). 14. M. T. Leung, W. E. Engeler and P. Frank, Fingerprint processing using backpropagation neural networks. Proc. lnr.

Join1 Coqf. Neural Netw. I, 15-20 (1990). 15. D. B. MalkofT, A neural network for real-time signal processing. Adv. Neuraflnform. Process. S.vst. 2.248-257 (1990). 16. T. J. Sejnowski, B. P. Yuhas, M. H. Goldstein Jr and R. E. Jenkins, Combining visual and acoustic speech signals

with a neural network improves intelligibility. Adv. Neural &form. Proc. Sysf. 2, 232-239 (1990). 17. P. M. Shea and F. Liu, Operational experience with a neural network in the detection of explosives in checked airline

baggage. Proc. hi. Joint ConJ: Neural Netw. II, 175-178 (1990). 18 J. C. Singleton, Neural nets for bond rating improved by multiple hidden layers. Inf. Joint Co@ N~urul Nefw. II.

151-162 (1990). 19. K. Y. Tam and M. Y. Kiang, Managerial applications of neural networks: the case of bank failure predictions. Mqn~

Sci. 38, 926-947 (1992). 20. T. W. Anderson, An Introduction to Multivariate Statistical Analysis, 2nd Edn. Wiley, New York (1984). 21. A. P. Rubin, A comparison of linear programming and parametric approaches to the two-group discriminant problem.

De&s. Sfi. 21, 373-386 (1990). 22. SAS User:9 Guide: Stat~.~tjcs, Version 5. SAS Institute, N.C. 23. N. Freed and F. Glover, Evaluating alternative linear programming fo~ulations for the two-group discriminant

problem. Decis. Sci. 17, 151-162 (1986). 24. S. M. Baigier and A. V. Hill, An experimental comparison of statistical and linear programming approaches to the

discriminant problem. Decis. Sci. 13, 604-618 (1982). 25. E. P. Markowski and C. A. Markowski, Some difficulties and improvements in applying linear programming formulations

to the discriminant problem. De&. Sci. 16, 237-247 (1985). 26. W. V. Gehrlein, General mathematical programming formulations for the statistical classification problem. Ops Res.

Left. 5, 299-304 (1986). 27. A. Stam and E. A. Joachimsthaler, Solving the classification problem in discriminant analysis via linear and nonlinear

programming methods. De&. Sei, 20, 285-293 (1989). 28. E. A. Joachimsthaler and A. Stam, Four approaches to the classification problem in discriminant analysis: an ex~rimental

study. De&. Sei. 19, 322-333 (1988). 29. W. S. McCullochand W. Pitts,Alogical of theideasimmanent innervousactivity. BuN. MathlBiophys. 5,115-133 (1943). 30. DARPA Neural Nefwork Study. Lincoln Laboratory, MIT Press, Cambridge, Mass. (1988). 31. Y.-H. Pao, Adaptive Pattern Recognition and Neural Net Implementation. Addison-Wesley, Reading, Mass. (1989 ). 32. P. D. Wasserman, Neural Computing-Theory and Practice. Anna Research, Reinhold, New York (1989). 33. D. E. Rumelhart, G. E. Hinton and R. J. Williams, Learning internal representations by error propagation. In Parallel

Distributed Processing: Explorations in the Microstructure of Cognition (Edited by D. E. Rumelhart and J. L. Williams). MIT Press, Cambridge, Mass. (1986).

34. M. S. Hung and J. W. Denton, Training neural networks using a nonlinear programming approach to back propagation. Europ. J. Opi Res. (1993).

35. V. Subramanian and M. S. Hung, A GRG2-based system for training neural networks: design and computational experience. ORSA J. Comput. (1993).

36. R. Fletcher, Practical Methods of Optimization. Wiley, New York (1987). 37. L. S. Lasdon and A. D. Waren, CRC2 User’s Guide, School of Business Admin., University ofTexas at Austin (1986). 38. J. W. Denton, M. S. Hung and B. A. Osyk, A neural network approach to the classification problem. E.upert Qt.

Applic. 1, 417424 (1990).


39. W. Y. Huang and R. P. Lippmann, Comparisons between neural net and conventional classifiers. IEEE Isr Inf. ConJ Neural Networks IV, 485-493 (1987).

40. G. E. P. Box, W. G. Hunter and J. S. Hunter, Stafisticsfor Experimenfers. Wiley, New York (1978). 41. E. S. Gilbert, The effect of unequal variance-covariance matrices on Fisher’s LDF. Biometrics 25, M-516 (1969). 42. S. Marks and 0. J. Dunn, Discriminant functions when covariance matrices are unequal. .I. Am. Statist. ASSOC. 69,

555-559 (1974). 43. A. Wald, On a statistical problem arising in the classification of an individual into one of two groups. Am. Math.

Statist. 15, 145-162 (1944).

Documents

An experimental evaluation of neural networks for classification