Robust linear discriminant analysis for chemical pattern recognition

ROBUST LINEAR DISCRIMINANT ANALYSIS FOR CHEMICALPATTERN RECOGNITION

YANG LI, JIAN-HUI JIANG, ZENG-PING CHEN, CHENG-JIAN XU AND RU-QIN YU*College of Chemistry and Chemical Engineering, Hunan University, Changsha 410082, People’s Republic of China

SUMMARY

Linear discriminant analysis (LDA) is an effective tool in multivariate multigroup data analysis. A standardtechnique for LDA is to project the data from a high-dimensional space onto a perceivable subspace such that thedata can be separated by visual inspection. The criterion of LDA, unfortunately, is extremely susceptible tooutliers which commonly occur because of instrument drift and gross errors. This paper proposes a robustdiscriminant criterion, and based on that criterion, a high-breakdown method for LDA is developed. In an effortto circumvent the local optima trapping, a real genetic algorithm (RGA) was used for the optimization of thecriterion. The RGA is capable of locating the global optimal solution with high probability and acceptablecomputational burden. Classification of one simulated data set and two real chemical ones shows that thedeveloped robust LDA (RLDA) method provides much superior performance to the standard method for outlier-contaminated data and behaves comparably well with the standard one for data without outliers. Copyright1999 John Wiley & Sons, Ltd.

KEY WORDS: outliers; linear discriminant analysis (LDA); robust linear discriminant analysis (RLDA);real genetic algorithm (RGA)

1. INTRODUCTION

Linear discriminant analysis (LDA) is one of the most useful pattern recognition techniques.1,2 Astandard method for LDA is to seek one discriminant vector along which the original high-dimensional data are projected while the discriminant information is optimally retained. This methodmakes possible the classification by visual inspection. It is known that the method is Bayes-optimal inthe case where the two groups under study follow a normal distribution and have the same covariancestructure. Unfortunately, chemical data structures in practice generally deviate from the idealsituation, and research on new LDA methodologies is still of considerable significance inchemometrics. It has been shown that in the case where two groups under study do not have similarnormal structures, improved classification ability can be frequently achieved by the introduction ofone more discriminant vector.3–7 For two-group problems, Sammon proposed two orthonormaldiscriminant vectors which form the optimal discriminant plane.3 Foley and Sammon later presentedan optimal discriminant vector set which provided a powerful tool for feature extraction.4

These standard LDA methods, however, have a serious shortcoming in that they are not robust tooutliers. It is known that the discriminant vector of LDA is defined by the true mean vectors and

JOURNAL OF CHEMOMETRICSJ. Chemometrics13, 3–13 (1999)

* Correspondence to: R.-Q. Yu, College of Chemistry and Chemical Engineering, Hunan University, Changsha 410082,People’s Republic of China. E-mail: [email protected]/grant sponsor: National Natural Science Foundation of China.

CCC 0886–9383/99/010003–11 $17.50Copyright 1999 John Wiley & Sons, Ltd.

Received 26 January 1998Accepted 21 September 1998

common covariance matrix from which the data come. These parameters are generally unknown andare estimated by the sample mean and sample covariance with a training data set collected for eachgroup. The core of the problem lies in the fact that these sample statistics are non-robust. Actually, theprojection vector might be dramatically shifted by the presence of even a single outlier, resulting in anincreased misclassification rate. Some measures have to be taken to mitigate the effect of outliers,which has led to the suggestion of using more robust estimates of sample mean and covari-ance. An early robust estimation proposal wasM-estimation. More details about it can be found inReferences 8–10. The weakness ofM-estimation is its low breakdown point of 1/m, wherem is thedimension of the data. The breakdown point is the smallest percentage of contaminated data that cancause the estimator to take on arbitrarily large aberrant values.11 When m= 20, for example,M-estimation performs unsatisfactorily for a data set with more than 5% outliers.

Instead ofM-estimation, there are such high-breakdown estimates as the minimum volumeellipsoid (MVE) and minimum covariance determinant (MCD) estimates of sample mean andcovariance.12,13 These provide the assurance of protection against a much higher proportion ofoutliers. The price paid for this protection is some loss of statistical efficiency. The MVE and MCDmethods require the data set to follow a normal distribution when they satisfy the contamination limitof 50%. A drawback with both methods is that they are unable to bypass the ‘curse of dimensionality’.This implies that the variances of these estimates are large in the case of a small sample set.

This paper reports a newly developed robust LDA (RLDA) method with some improvedproperties. The method provides good separation between different groups even when the groupstructures under treatment depart substantially from the assumed normal models. It might be guessedthat the RLDA method would possess a breakdown point as high as the MVE and MCD methodsunder the same assumption of the data set with a normal distribution. By projecting the data set from ahigher-dimensional space onto a lower one, this method realizes the reduction of dimension whichsatisfactorily avoids the ‘curse of dimensionality’.

In an effort to circumvent local optima trapping, a real genetic algorithm (RGA) was used for theoptimization of the criterion. The RGA is capable of locating the global optimum with highprobability and acceptable computational burden. To avoid difficulty in selection of geneticoperators, the optimization problem with constraints was converted to a constraint-free one, whichmade the proposed method computationally easier.

Following a quick review of standard LDA, Section 2 describes the proposed RLDA method. InSection 3 the RLDA method is evaluated by treating one simulated and two real chemical data setsfrom different sources and comparing the results with those of the standard LDA method. Finallysome results and discussion are given in Section 4.

2. THEORY

A standard technique for LDA is to seek a discriminant vector along which the projections of differentgroups are separated as much as possible. It can be considered as a special case of projection pursuit(PP) which is concerned with the projection of a high-dimensional point cloud by numericallymaximizing a certain projection index. For simplicity and without loss of generality, only two-groupproblems will be considered.

The standard LDA method searches for the projection vectora by doing PP with the projectionindex being at-statistic which has the form

t�a� � �aTD�2

aTWa�1�

wherea is the m-dimensional projection vector and the superscript T denotes the transpose of a

4 Y. LI ET AL.

Copyright 1999 John Wiley & Sons, Ltd. J. Chemometrics, 13, 3–13 (1999)

matrix. The sample group mean displacementD is computed as

D � m1 ÿ m2 �2�where

mi �1ni

Xni

j�1

xij

is the sample mean vector for classi. The within-class covariance matrixW is defined as

W � S1� S2 �3�where

Si � 1ni ÿ 1

Xni

j�1

�xij ÿ mi��xij ÿ mi�T

It is well known that the projection index of LDA was originally defined by the true mean vectorsmi and the covariance matricesSi. These parameters provide a means for a measure of separabilityand to classify an unknown which is believed to come from the same source population. The trueparameters, nonetheless, are commonly unknown and it is the function of the training data to helpprovide estimates for them. An example is the projection index as mentioned above. Because of thesample statistics’ sensitivity to outliers, the standard LDA method performs poorly in classifyingsamples in the presence of outliers. For this reason a robust linear discriminant criterion is defined as

t�ai� � jmed�aTi X1� ÿmed�aT

i X2�jMAD�aT

i X1� �MAD�aTi X2� ; i � 1; 2 �4�

whereX i is the sample matrix of theith class andmedrefers to the sample median.MAD is the medianabsolute deviation12 given by

MAD�aTi X j� � med�jaT

i X j ÿmed�aTi X j�j�; j � 1; 2 �5�

The motivation of the criterion is easy to understand from a geometric point of view. As shown inFigure 1 the numerator is a robust measure of separability between two one-dimensional normalpopulations, while the denominator is the sum of two robustified scatter measures of normalpopulations.

It is important to note that the assumption that the data set follows a normal distribution in onedimension can be guaranteed by the central limit theorem with a sufficiently large number of data.12

Figure 1. Graphical representation of criterion of proposed robust linear discriminant analysis. The�s and*srepresent classes 1 and 2 respectively. Whenj med�aT

i X1� ÿmed�aTi X2� j =�3�1 � 3�2� > 1 the two groups

can be separated well�� MAD=0 � 6745�.

RLDA FOR CHEMICAL PATTERN RECOGNITION 5


Unfortunately, this is not the case in practice, since the sample number is limited in experiments.Therefore efforts were made to obtain the proposed criterion (equation (4)) robust both to thedeviation of the data distribution from the normal one and to outliers. Since it considers the separateestimation of the mean vector and covariance of each group, the projection index can be applieddirectly to heteroscedastic cases.

Based on the robust discriminant criterion, a linear discriminant analysis (RLDA) method withhigh breakdown point can be developed. In fact, by maximizing the criteriont(a) subject to

aTi aj � �ij ; �ij � 1; i � j

0; i 6� j

��6�

one can obtain an optimal discriminant plane. Then the dimensionality reduction is realized and thediscriminant analysis can conveniently be studied in two-dimensional space. The discriminant powercan be further improved by using a non-linearly separating boundary in the space of reduceddimension. With such a choice of these two vectors the proposed method has two merits: one is thatmore outliers on two projection vectors may be discovered; the other is that the performance of thismethod is satisfactory even when there is a departure from the assumed distributional model.

In standard LDA the discriminant vectors are calculated by maximizing the criterion (equation (1))subject toaT

i W aj = 1 (i = j) andaTi W aj = 0 (i = j). Under such a constraint the standard LDA

method does not provide an orthonormal co-ordinate system for discrimination in most cases. Inrobust LDA, however, the discriminant vectors are extracted by maximizing equation (4) subject toaT

i aj = 1 (i = j) andaTi aj = 0 (i = j), which results in orthonormal projection vectors. It has been

suggested that the orthonormal discriminant vector method is more powerful than ordinarydiscriminant analysis.14 In this paper the orthonormal discriminant vector approach is adapted toseparate groups. An alternative approach is to estimate the covariance matrix by robust methods, inwhich case the orthonormality constraint is not retained.13

Genetic algorithms have the ability to optimize the placement of a separating surface in a dataspace.15 Since there exists no analytical solution to the above non-convex maximization problem, areal genetic algorithm (RGA) was used to solve it. The detailed principle and steps of the geneticalgorithm16–18for maximizing the criterion of equation (4) are as follows.

1. Randomly initialize a population ofNp candidate solutions coded as vectors such that eachgene of an individual takes a value fromÿ1 to 1 with equal probability. Each individual is avector ofm components.Np is a user-defined parameter and here is set to 20.

2. Calculate the values of the criterion, equation (4), of each individual in the population.3. If the number of generations calculated reaches 800 or the optimal performance has not been

improved for 100 generations, the algorithm terminates; otherwise it continues.4. Evaluate the fitness of individuals in the current population using the equation

fitness(i) = 1/{1 + exp[ÿ(at(a) + b)]} (7)wherea andb are two constants and are determined in a manner such that the following set ofequations holds:

at(a) [Np/4] + b = 3at(a) [Npÿ Np/4] + b =ÿ 3 (8)

5. Select a parent population of 20 individuals in the current population.6. Pair the (2j - l)th parent with the (2j)th parent (j = 1, …, Np/2) in the parent population.7. Crossover theNp/2 parent pairs to generateNp offspring (offspring population 1).8. Mutate each gene in offspring population 1 with a probabilitypm such as to produce another

offspring population (offspring population 2).9. Calculate the criterion values of individuals in each offspring population and evaluate their

6 Y. LI ET AL.


fitness using equation (7).10. All individuals in the current population and these two offspring populations are exposed to

competition. The individual with the largest fitness score is allowed to survive such as togenerate a surviving population.

11. Let each individual in the surviving population self-produce with a probabilitypr andsubsequently update the individuals by their own currently best offspring generated by self-reproduction.

12. The updated surviving population is subjected to the diversification operator such as togenerate a population of sizeNp of the next generation.

13. Return to step 3.

It is known that when a genetic algorithm is applied to an optimization problem with constraints,considerable difficulty will be encountered in the design of genetic operators such as crossover andmutation. In this paper, to avoid the difficulty, the optimization problem of discriminant vectorssubject to the constraint of orthonormality is converted to a constraint-free one by using the projectionmatricesPi of the feasible space ofai (i = 1,2) as defined by

P1 � I

P2 � �I ÿ a1aT1�

whereI is them�m identity matrix.Let

ai � Pibi ; i � 1; 2 �9�wherebi is the coefficient vector ofai. As there are no constraints onbi, optimization ofbi using theRGA can be implemented straightforwardly and the discriminant vectorsai can be calculatedaccording to equation (9).

3. EXAMPLES

One simulated data set and two data sets reported in References 19 and 20 were used for evaluatingthe proposed robust LDA method. To validate the performance of RLDA in the case of the presenceof outliers, the data sets have been processed both by RLDA and by standard LDA and the results arecompared.

3.1. Simulated data

The simulated data set is composed of two classes, each consisting of 50 four-dimensional samples.Classes 1 and 2 follow normal distributions with different mean vectorsm and covariance matricesS:

class 1: m1 � E�x� � �2; 3; 3; 2�T; S1 � E��xÿm1��xÿm1�T� � diag�22; 12; 12; 12�T

class 2: m2 � E�x� � �8; 6; 7; 8�T; S2 � E��xÿm2��xÿm2�T� � diag�12; 12; 32; 12�T

where diag(x1, x2, x3, x4) is the diagonal matrix with diagonal elementsx1, x2, x3 andx4.

3.2. Blood data

The blood data19 consist of 26 blood samples belonging to two categories: healthy persons andinvalids who have coronary heart disease. Each sample is represented by a set of measurements onfour trace elements in the blood: Sr, Cu, Mg and Zn. There is the same number of samples in eachgroup.



3.3. QSAR data

The quantum chemical descriptor data20 are made up of 188 samples with eight descriptors. Thesesamples fall into two categories according to their toxicity. The prediction set is composed of tensamples from class 1 and 80 samples from class 2 selected randomly. The remaining samples, i.e. 40from class 1 and 58 from class 2, comprise the learning set in which about 25% outliers are randomlymislabelled.

Figure 2. Projection of original simulated data onto first two discriminant vectors: (a) discriminant vectorsobtained using standard LDA method; (b) discriminant vectors obtained using robust LDA method. Projection ofsimulated data with 49% outlier contamination onto first two discriminant vectors: (c) discriminant vectorsobtained using standard LDA method; (d) discriminant vectors obtained using robust LDA method. Category 1,

*; category 2,�.

8 Y. LI ET AL.


All calculations were carried out on an IBM-PC Pentium computer and all computer programswere written in Matlab (Matlab for Windows, Version 4⋅0).

4. RESULTS AND DISCUSSION

To start the RGA, an appropriate setting must be selected. A consistent parameter setting was used inthis work for both data sets studied, whereNp was set to 20,pm was set to 0⋅3 andpr was set to 0⋅1.

In order to get a clear comparison, in the standard LDA method the second projection vectora2

was calculated by doing PP with projection index equation (1) subject toaTi W a2 = 0. Consequently,

the two-dimensional displays of both data sets for both approaches (LDA and RLDA) were obtained.As presented in Figures 2(a) and 2(b), the two classes of the simulated data can be discriminated

clearly by both the standard LDA and proposed RLDA methods. To illustrate the high breakdownpoint of the RLDA method, 30%, 40%, 48% and 49% outliers were added into the simulated data bymislabelling the samples and both methods were carried out for them. In each case the standard LDAmethod performed poorly, while the RLDA method separated the two classes of samplessatisfactorily. Figures 2(c) and 2(d) show the results for the 49% outlier-contaminated data setobtained by the standard LDA and RLDA methods respectively. Samples belonging to the differentclasses are overlapped by the standard LDA method, while the RLDA method shows a much betterperformance. These results suggest that the RLDA method possesses a fairly high breakdown point. Itwas also found that the MVE method performed as well as the RLDA method in this case.

The blood data set was analysed by the standard LDA and proposed robust linear discriminantmethods. The results are presented in Figures 3 and 4. Figure 3 shows that both methods work well forthe data set without outliers. Two points are misclassified for both methods. When 23% artificialoutliers produced by mislabelling the samples (samples 5, 8, 11, 20, 22 and 25) are added into the dataset, the situation changes dramatically as shown in Figure 4. It is apparent that the results of the

Figure 3. Projection of original blood data onto first two discriminant vectors: (a) discriminant vectors obtainedusing standard LDA method; (b) discriminant vectors obtained using robust LDA method. Category 1,*;

category 2,�.



standard LDA have deteriorated (Figure 4(a)). The points of the different categories are overlapped,which makes the classification impossible. The RLDA, however, performs as well as in the casewhere no outliers exist (Figure 4(b)). There are only two points misclassified.

Figure 5 shows the results of the standard LDA and proposed RLDA methods for the QSAR data

Figure 4. Projection of blood data with artificial outliers onto first two discriminant vectors: (a) discriminantvectors obtained using standard LDA method; (b) discriminant vectors obtained using robust LDA method.

Category 1,*; category 2,�.

Figure 5. Projection of original QSAR data onto first two discriminant vectors: (a) discriminant vectors obtainedusing standard LDA method; (b) discriminant vectors obtained using robust LDA method. Category 1,*;

category 2,�.

10 Y. LI ET AL.


set. An inspection of the original data table20 reveals that the 180th sample is an obvious outlier.When about 14% artificial outliers are introduced, they will severely influence the performance of thestandard LDA (Figure 6(a)). As presented in the amplified Figure 6(a') about 17 points are notclassified correctly and two groups are separated to some extent. The RLDA method, however,provides a different picture. It can not only separate the samples into two groups well but also revealsthe outlier existing in the original data set. Note that the outlier is on the second projection vector.Again the results of Figure 6(b) show the RLDA method’s resistance to outliers.

Figure 6. Projection of QSAR data with artificial outliers onto first two discriminant vectors: (a) discriminantvectors obtained using standard LDA method; (a') amplification of (a); (b) discriminant vectors obtained using

robust LDA method. Category 1,*; category 2,�.



The predictive performance of RLDA compared with that of standard LDA for the QSAR data isillustrated in Figures 7 and 8. When about 25% of the samples in the training set are mislabelled, thestandard LDA method (Figure 7) performs poorly for both training and prediction. As Figure 8 shows,

Figure 7. (a) Projection of training samples from QSAR data onto first two discriminant vectors obtained usingstandard LDA method. (b) Projection of prediction samples onto discriminant vectors obtained from standard

LDA learning with training set. Category 1,*; category 2,�.

Figure 8. (a) Projection of training samples from QSAR data onto first two discriminant vectors obtained usingRLDA method. (b) Projection of prediction samples onto discriminant vectors obtained from RLDA learning

with training set. Category 1,*; category 2,�.

12 Y. LI ET AL.


the proposed technique yields satisfactory results. Discriminant vectors are achieved by RLDA evenfrom the outlier-contaminated training set (Figure 8(a)). Consequently, by doing projection onto thediscriminant vectors obtained, the predictive samples can be separated correctly in Figure 8(b).

To investigate the reproducibility of the optimization step, the RGA was repeated ten times. Theresults obtained were very similar. In fact, such reproducible solutions are guaranteed by the localsearch in the optimization procedure.

From the above the proposed RLDA method shows high resistance to the effect of contamination inthe data set. The RLDA possesses a breakdown point as high as the MVE and MCD methods underthe same assumption that the data set follows a normal distribution. Compared with these lattermethods, the RLDA method avoids the ‘curse of dimensionality’ by projecting the data set from ahigher-dimensional space onto a lower one.

5. CONCLUSIONS

Preliminary applications show that the proposed robust LDA method is a promising tool in chemicalpattern recognition. The RLDA method has the advantage over standard LDA that it performs well inall cases even in the presence of a large number of outliers. In the case where the data are notcontaminated by outliers, the robust LDA method provides comparable performance with thestandard LDA method.

ACKNOWLEDGEMENT

The authors are grateful to the National Natural Science Foundation of China for financial support.

REFERENCES

1. R. O. Duda and P. E. Hart,Pattern Classification and Scene Analysis, Wiley, New York (1973).2. K. Fukunaga,Introduction to Statistical Pattern Recognition, Academic Press, New York (1972).3. J. W. Sammon,IEEE Trans. Comput.C-19, 826 (1970).4. D. H. Foley and J. W. Sammon,IEEE Trans. Comput.C-24, 281 (1975).5. R.-Q. Yu,Introduction to Chemometrics, Hunan Education Press, Changsha (1991).6. J. Kittler and P. C. Young,Pattern Recognit.5, 335 (1973).7. J. Kittler, IEEE Trans. Comput.C-26, 604 (1977).8. P. J. Huber,Anal. Math. Statist.35, 73 (1964).9. P. J. Huber,Robust Statistics, Wiley, New York (1981).

10. W. J. J. Rey,Introduction to Robust and Quasi-Robust Statistics Methods, Springer, New York (1983).11. F. R. Hampel,Anal. Math. Statist.42, 1887 (1971).12. F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw and W. A. Stahel,Robust Statistics: The Approach Based on

Influence Functions, Wiley, New York (1986).13. P. J. Rousseeuw and A. M. Leroy,Robust Regression and Outlier Detection, Wiley, New York (1986).14. W. J. Krzanowski,J. Chemometrics, 9, 509 (1995).15. R. Shaffer and G. W. Small,Chemometrics Intell. Lab. Syst.35, 87 (1996).16. C. B. Lucasiusm, A. D. Dane and G. Kateman,Anal. Chim. Acta, 282, 647 (1993).17. J. H. Jiang, J. H. Wang, X. Chu and R. Q. Yu,Anal. Chim. Acta, 354, 263 (1997).18. J. H. Jiang, J. H. Wang, X. H. Song and R. Q. Yu,J. Chemometrics, 10, 253 (1996).19. L. Xu, Chemometrics Methods, p. 203, Science Press, Beijing (1995).20. H. J. M. Verhaar, E. U. Ramos and J. L. M. Hermens,J. Chemometrics, 10, 149 (1996).



Documents

Robust linear discriminant analysis for chemical pattern recognition