11
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-2, NO. 3, MAY 1980 [91 C. Hilditch, "Linear skeletons form square cupboards," Machine Intell., vol. 4, pp. 403-420, 1969. [101 T. Pavlidis, "Algorithms for shape analysis of contours and wave- forms," in Proc. 4th Int. J. Conf on Pattern Recognition, 1978, pp. 70-85. [111 C. Arcelli and S. Levialdi, "Picture processing and overlapping blobs," IEEE Trans. Comput., vol. C-20, pp. 111 1-1 115, 1971. [121 H. Freeman, "Computer processing of line-drawing images," Comput. Surveys, vol. 6, pp. 57-97, 1974. [131 G. Gallus and P. W. Neurath, "Improved computer chromosome analysis incorporating preprocessing and boundary analysis," Phys. Med. Biol., vol. 15, pp. 435-445, 1970. [141 H. Freeman, "Shape description via the use of critical points," Pattern Recognition, vol. 10, pp. 158-166, 1978. [151 H. Blum and R. N. Nagel, "Shape description using weighted symmetric axis features," Pattern Recognition, vol. 10, pp. 167- 180, 1978. [16] L. G. Shapiro and R. M. Haralick, "Decomposition of two- dimensional shapes by graph-theoretic clustering," IEEE Trans. Pattern Anal. Machine Intell., vol. PAMI-1, pp. 10-20, 1979. [17] F. J. Rohlf, "Adaptive hierarchical clustering schemes," Syst. Zool., vol. 19, pp. 58-82, 1970. [18] R. Dubes and A. K. Jain, "Models and methods in cluster valid- ity," in Proc. IEEE Conf. on Pattern Recognition and Image Processing, 1978, pp. 148-155. [191 R. 0. Duda and P. E. Hart, Pattern Classification and Scene Analysis. New York: Wiley, 1973. Anil K. Jain (S'70-M'72) was born in Basti, - India, on August 5, 1948. He received the B. Tech. degree with distinction from the Indian Institute of Technology, Kanpur, India, in 1969, and the M.S. and Ph.D. degrees in electrical engi- neering from Ohio State University, Columbus, in 1970 and 1973, respectively. From 1971 to 1972 he was a Research Associate in the Communications and Control 0 - ! I 00 0Systems Laboratory, Ohio State University. Then, from 1972 to 1974, he was an Assistant Professor in the Department of Computer Science, Wayne State Univer- sity, Detroit, MI. In 1974, he joined the Department of Computer Science, Michigan State University, where he is currently an Associate Professor. His research interests are in the areas of pattern recognition and image processing. He was recipient of the National Merit Scholar- ship in India. Dr. Jain is a member of the Association for Computing Machinery, the Pattern Recognition Society, and Sigma Xi. Stephen P. Smith was born in Cincinnati, OH, on September 4, 1955. He received the B.S. and the M.S. degrees in computer science from Michigan State University, East Lansing, in 1977 and 1979, respectively. Since 1977, he has been a Graduate Research Assistant in the Pattern Recognition and Image | !Processing Laboratory in the Department of D *':Pr Computer Science at M.S.U. During the sum- mer of 1979, he was a consultant at Babcock and Wilcox's Lynchburg Research Center working on problems of feature selection and image processing in a nondestructive examination environment. Currently, he is working on a Ph.D. degree at M.S.U. His current research interests include shape matching and cluster validity. Mr. Smith is a member of the Association for Computing Machinery and Phi Kappa Phi. -Eric Backer was born in Soestdijk, The Nether- lands, on April 22, 1940. He received the M.S. and Ph.D. degrees in electrical engineering from Delft University of Technology, Delft, The Netherlands, in 1969 and 1978, respectively. 4 Since 1967 he has been at the Delft University of Technology, where he is currently a Senior Staff member of the Information Theory Group. He is engaged in research and teaching in the areas of pattern recognition and image processing. On Dimensionality, Sample Size, Classification Error, and Complexity of Classification Algorithm in Pattern Recognition SARUNAS RAUDYS AND VITALIJUS PIKELIS Abstract-This paper compares four classification algorithms-discrim- inant functions when classifying individuals into two multivariate popu- lations. The discriminant functions (DF's) compared are derived according to the Bayes rule for normal populations and differ in assump- tions on the covariance matrices' structure. Analytical formulas for the expected probability of misclassification EPN are derived and show that the classification error EPN depends on the structure of a classification algorithm, asymptotic probability of misclassification P.., and the ratio of learning sample size N to dimensionality p :N/p for all linear DF 's discussed and N2/p for quadratic DF's. The tables for learning quan- Manuscript received February 27, 1978; revised June 28, 1979. The authors are with Lietuvos RSR Moksly, Adademiha, Lenino, U.S.S.R. tity H = EPN/P., depending on parameters P.,, N, and p for four classifilcation algorithms analyzed are presented and may be used for estimating the necessary learning sample size, detennining the optimal number of features, and choosing the type of the classification algo- rithm in the case of a limited learning sample size. Index Terms-Classification error, dimensionality, discriminant func- tions, pattern recognition, sample size. I. INTRODUCTION S IGNIFICANT research efforts were made in the area of statistical pattern recognition in the case when the learning sample size is limited. Finiteness of the sample size causes 0162-8828/80/0500-0243$00.75 C 1980 IEEE 242

On Dimensionality, Sample Size, Classification Error, and Complexity of Classification Algorithm in Pattern Recognition

Embed Size (px)

Citation preview

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-2, NO. 3, MAY 1980

[91 C. Hilditch, "Linear skeletons form square cupboards," MachineIntell., vol. 4, pp. 403-420, 1969.

[101 T. Pavlidis, "Algorithms for shape analysis of contours and wave-forms," in Proc. 4th Int. J. Conf on Pattern Recognition, 1978,pp. 70-85.

[111 C. Arcelli and S. Levialdi, "Picture processing and overlappingblobs," IEEE Trans. Comput., vol. C-20, pp. 111 1-1 115, 1971.

[121 H. Freeman, "Computer processing of line-drawing images,"Comput. Surveys, vol. 6, pp. 57-97, 1974.

[131 G. Gallus and P. W. Neurath, "Improved computer chromosomeanalysis incorporating preprocessing and boundary analysis,"Phys. Med. Biol., vol. 15, pp. 435-445, 1970.

[141 H. Freeman, "Shape description via the use of critical points,"Pattern Recognition, vol. 10, pp. 158-166, 1978.

[151 H. Blum and R. N. Nagel, "Shape description using weightedsymmetric axis features," Pattern Recognition, vol. 10, pp. 167-180, 1978.

[16] L. G. Shapiro and R. M. Haralick, "Decomposition of two-dimensional shapes by graph-theoretic clustering," IEEE Trans.Pattern Anal. Machine Intell., vol. PAMI-1, pp. 10-20, 1979.

[17] F. J. Rohlf, "Adaptive hierarchical clustering schemes," Syst.Zool., vol. 19, pp. 58-82, 1970.

[18] R. Dubes and A. K. Jain, "Models and methods in cluster valid-ity," in Proc. IEEE Conf. on Pattern Recognition and ImageProcessing, 1978, pp. 148-155.

[191 R. 0. Duda and P. E. Hart, Pattern Classification and SceneAnalysis. New York: Wiley, 1973.

Anil K. Jain (S'70-M'72) was born in Basti,- India, on August 5, 1948. He received the B.

Tech. degree with distinction from the IndianInstitute of Technology, Kanpur, India, in 1969,and the M.S. and Ph.D. degrees in electrical engi-neering from Ohio State University, Columbus,in 1970 and 1973, respectively.From 1971 to 1972 he was a Research

Associate in the Communications and Control0 - !I000Systems Laboratory, Ohio State University.

Then, from 1972 to 1974, he was an Assistant

Professor in the Department of Computer Science, Wayne State Univer-sity, Detroit, MI. In 1974, he joined the Department of ComputerScience, Michigan State University, where he is currently an AssociateProfessor. His research interests are in the areas of pattern recognitionand image processing. He was recipient of the National Merit Scholar-ship in India.Dr. Jain is a member of the Association for Computing Machinery,

the Pattern Recognition Society, and Sigma Xi.

Stephen P. Smith was born in Cincinnati, OH,on September 4, 1955. He received the B.S.and the M.S. degrees in computer science fromMichigan State University, East Lansing, in1977 and 1979, respectively.Since 1977, he has been a Graduate Research

Assistant in the Pattern Recognition and Image| !Processing Laboratory in the Department of

D*':Pr Computer Science at M.S.U. During the sum-mer of 1979, he was a consultant at Babcockand Wilcox's Lynchburg Research Center

working on problems of feature selection and image processing in anondestructive examination environment. Currently, he is working ona Ph.D. degree at M.S.U. His current research interests include shapematching and cluster validity.Mr. Smith is a member of the Association for Computing Machinery

and Phi Kappa Phi.

-EricBacker was born in Soestdijk, The Nether-lands, on April 22, 1940. He received the M.S.and Ph.D. degrees in electrical engineeringfrom Delft University of Technology, Delft,The Netherlands, in 1969 and 1978, respectively.

4 Since 1967 he has been at the Delft Universityof Technology, where he is currently a SeniorStaff member of the Information TheoryGroup. He is engaged in research and teachingin the areas of pattern recognition and imageprocessing.

On Dimensionality, Sample Size, Classification Error,and Complexity of Classification Algorithm in

Pattern RecognitionSARUNAS RAUDYS AND VITALIJUS PIKELIS

Abstract-This paper compares four classification algorithms-discrim-inant functions when classifying individuals into two multivariate popu-lations. The discriminant functions (DF's) compared are derivedaccording to the Bayes rule for normal populations and differ in assump-tions on the covariance matrices' structure. Analytical formulas for theexpected probability of misclassification EPN are derived and show thatthe classification error EPN depends on the structure of a classificationalgorithm, asymptotic probability of misclassification P.., and the ratioof learning sample size N to dimensionality p :N/p for all linear DF 'sdiscussed and N2/p for quadratic DF's. The tables for learning quan-

Manuscript received February 27, 1978; revised June 28, 1979.The authors are with Lietuvos RSR Moksly, Adademiha, Lenino,

U.S.S.R.

tity H = EPN/P., depending on parameters P.,, N, and p for fourclassifilcation algorithms analyzed are presented and may be used forestimating the necessary learning sample size, detennining the optimalnumber of features, and choosing the type of the classification algo-rithm in the case of a limited learning sample size.

Index Terms-Classification error, dimensionality, discriminant func-tions, pattern recognition, sample size.

I. INTRODUCTIONS IGNIFICANT research efforts were made in the area of

statistical pattern recognition in the case when the learningsample size is limited. Finiteness of the sample size causes

0162-8828/80/0500-0243$00.75 C 1980 IEEE

242

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-2, NO. 3, MAY 1980

some effects. Due to this reason parameters of a classificationrule are determined inaccurately; therefore, the classificationerror increases (see, e.g., [1] -[3]). Finiteness of the learningsample causes the peak effect, that is, why the problem ofdetermination of the optimal number of features arises (see,e.g., [41, [5] ). In a case of a finite sample size a classificationerror becomes biased; therefore, special methods are con-structed in order to obtain unbiased estimates [6], [7]. Inorder to use the statistical classification rules correctly-tochoose the proper type of a classification rule, determine theoptimal number of features, and estimate the sufficientlearning sample size-one must know the quantitative depen-dence of the classification error on the learning sample size,the number of features, the type of the classification algo-rithm, etc.This article compares four classification algorithms-discrim-

inant functions when classifying individuals into two multi-variate populations. The discriminant functions (DF's) com-pared are derived according to the Bayes rule for normalpopulations and differ in assumptions on the covariancematrices' structure.

1) The quadratic discriminant function,

g(X) = (X2- S2'X X2 )(X - X1 )S1 (X

+ln s+k; (1)

2) the standard linear DF

g(X) = [X - -(X + X2 )] S l(VI - X-2 );(2)3) the linear DF for independent measurements

g(X) = [X- O2(X + XY2) I _(OX - XV2) (3)

4) a Euclidean distance classifier

g(X) = [X- 2 (OX + X2)] '(X - X2). (4)In formulas (1)-(4) X denotes the p-variate observation

vector to be classified, XI and X2 stand for sample estimatesof population means p1 and P2, SI and S2 are sample esti-mates of population covariance matrices (CM's) El and 2, Sis pooled sample CM S = (SI + S2)/2, and ED is a diagonalmatrix, constructed from diagonal elements of S.The quadratic DF (1) is asymptotically an optimal classifica-

tion rule for classifying into two normal populations, DF (2) isasymptotically optimal for classifying into normal populationswith common CM, etc.In the nonasymptotic case when the learning sample size is

limited the classification rules (1)-(4) are not optimal, andbeing applied to the same pattern recognition problem lead todifferent outcomes. In order to clarify the use of the classifi-cation algorithm in accordance with the sample size, a dimen-sionality, and other characteristics of the problem, the depen-dence of a classification error on the characteristics mentionedabove must be investigated.Before analyzing the relationship between the classification

error and specific characteristics of a pattern recognition prob-lem let us determine several sorts of the probability of mis-classification (PMC).

If the underlying probability density functions are known

to the investigator he may construct an optimal Bayes classi-fier. Its performance (PMC) will be denoted PB and referredto as a Bayes one. When a classifier is designed on the learningsample the PMC will depend on the characteristics of thisparticular sample. Then the PMC may be regarded as a ran-dom variable PN, whose distribution depends on the learningsample size. This PMC will be called a conditional PMC [6].Its expectation EPN over all learning samples of size N1, N2patterns (from the first and second classes) will be regarded asan expected PMC. A theoretical limit P. = lim EPN is calledan asymptotic PMC and is denoted by Poo. N-oThe conditional PMC PN depends both on the type of a

classification rule and on the particular learning sample.Several properties of the probabilities of misclassification

mentioned above may be derived without detailed analyticinvestigations. The Bayes PMC PB depends only on the dis-tribution density functions of the measurements and does notdepend on a learning sample. The asymptotic PMC P0. is thecharacteristics of the classification rule type. If the classifica-tion into two multivariate normal populations is made, theasymptotic PMC of the quadratic DF (1) P,, will coincide withthe Bayes PMC PB; for other discriminant functions theasymptotic probabilities of classification may exceed PB.

It is obvious that PN > PB and one may hope that PN > P.,however, the last inequality always holds only for asymptot-ically optimal classification rules.

If the rule is not an asymptotically optimal one, sometimeswe may observe the case PB < PN < P°.For asymptotic optimal classifiers the expected PMC EPN

and the difference EPN - PO. is diminished when the learningsample size increases.In the experiments with real data the inequality EPN > P°.

usually holds for asymptotically nonoptimal rules, too. How-ever, it is possible to construct a model of true populations, inwhich for some values of NEPN's are less than PO. (e.g., thepopulations consist of clusters with unequal probabilities).Some efforts were made to obtain the quantitative depen-

dence of the classification error EPN on the learning samplesize. John [2] considered the standard linear DF (2) in a casewhen the covariance matrix is known. This is equivalent tothe investigation of the Euclidean distance classifier (4) in thecase of spherically normal populations. John derived anapproximate and an exact formula for the expected PMC. Inthe exact formula EPN was expressed as an infinite sum ofincomplete beta-functions and was practically uncomputable.Analogous double sums for the expected PMC later werederived by Troicky [8] and Moran [9].The standard linear DF (2) was extensively studied by many

investigators. R. Sitgreaves [10] derived the exact formula forEPN in the form of five times infinite sum of the products ofcertain gamma-functions. Her derivation was based on A. H.Bowker's representation of DF (2) in terms of simple statistics[ 1 ] . Unfortunately, the formula of Sitgreaves was practicallyuncomputable. In his unpublished work [5] S. E. Estesreduced this formula to the form suitable for numerical cal-culations. He presented the curves of EPN's dependence onthe ratio N/p for some particular dimensionalities p and PMC'sP0. (he assumed N1 = N2 = N). However, his results were notpublished and remained practically unknown even in hiscountry. On the other hand, Estes' algorithm had some short-

243

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-2, NO. 3, MAY 1980

comings which resulted in low accuracy for some particularvalues of parameters. Simultaneously, several asymptoticexpansions [31, [121, [13] and approximate formulas [14]-[16] for DF (2) were published. The accuracy of theseformulas was not known. Therefore, the problem of accuracyis under consideration until now [17] .The behavior of classification rules (1) and (3) was investi-

gated only by means of simulation [181, [19].In our papers [20]-[25] we derived the formulas for the

expected PMC EPN for DF (1) and (4) in the form of integrals,for the standard linear DF (2)-in the form of the improved S.Estes [5] sum, and for DF (3) EPN was studied by approx-imate formulas and by means of simulation. Quantitative andqualitative relationships between the expected PMC, learningsample size, dimensionality, Mahalonobis distance, and com-plexity of the classification algorithm was obtained in theform of tables and simple asymptotic formulas. The purposeof this publication is to present the results mentioned above toEnglish speaking readers.

II. FORMULAS AND TABLES FOR THEEXPECTED PROBABILITY OF MISCLASSIFICATION

The expected PMC is defimed by

I

EPN =,fPNf(PN) dPN (5)

where f(PN) stands for the probability distribution densityfunction of a random variable PN. Derivation of the densityfunction f(PN) is complicated. An alternate frequently usedexpression is

2EPN = E qiP {(-1)ig(X, ) >OIX E } (6)

i=l

where 0 stands for the random vector of the populationparameter estimates. In the case of the quadratic DF 0 con-sists of vectors Xl, X2 and matrices Sl, S2.As g(X, 0) depends on a random vector, the calculation of

EPN requires multivariate statistical analysis techniques.The discriminant function g(X, 0) is represented as a func-

tion of several scalar independent random variables. Deriva-tion of the expression is presented in papers [20]-[25]. Theprinciple points of the derivation are presented in the Appendix.The formulas for the calculation of the expected probability

of misclassification are rather complex and unsuitable for prac-tical everyday use. It is more convenient to use tabulatedvalues of EPN. Therefore, we present the values of the ratioH = EPNIP°. as a function of learning sample size N = N1 = N2,dimensionality p, and Mahalanobis distance 6 in Table I. Theexpected PMC characterizes the classification performance,while the ratio H = EPNIP°°-we shall call it a learning quan-tity-characterizes the accuracy of determining the discrim-inant function's coefficients. The table contains values of Hfor four classification rules (1)-(4) investigated. The valuesfrom Table I are valid:

1) for rules (1)-(4)-in the case of spherically normal pop-ulations;2) for rules (1)-(3)-if the measurements are independent

and normally distributed;

3) for rules (1), (2)-if the populations are normal with acommon covariance matrix.

It should be noted that computation ofEPN values is rathercomplicated. So the accuracy of our table for rule (1) is nothigh (when 6 = 5.5 and N/p = 1.6-2, the error may reach a fewpercent). Most accurate are the H values for rules (2) and (4),where all the decimal signs are correct.

III. SIMULATION EXPERIMENTS

The values of the learning quantity H = EPN/P presentedin Table I correspondent to an ideal case-spherically normaldistributions. In practical cases the populations are notspherically normal. It is possible to construct theoreticalmodels of populations, for which the value of H would besignificantly greater or smaller than the values presented inTable I. For non-Gaussian data H may be less than 1! Wehope, however, that in practice such cases will be rare. Inorder to estimate the declination of the observed H valuesfrom the tabulated ones, some experiments were made.We used four sets of real data. In each experiment a random

learning sample (of N vectors per class) was drawn from thetotal sample, consisting of Nm vectors from each class. Theparameters of all the DF's investigated were estimated andvectors of the total sample (excluding vectors of the learningsample) were classified. The expected PMC was estimated asthe average of error rates (conditional PMC) in 10-100 runs.The results of experiments with real data are presented in

upper rows of Table II. In lower rows the value of H, cal-culated analytically for the corresponding values of N, p, and6 (for spherical population), are given. The real data used inthis experiment were not specially selected and they differfrom the spherically normal ones rather significantly [24].As seen from Table II, in most cases the difference betweenexperimental and analytical values of H was not great. Anexception is made by data 3 where Nm is comparatively smalland due to the nonnormality of data the asymptotic PMC forthe quadratic DF was six times greater than that for thelinear DF.

IV. DISCUSSIONComplex analytical formulas for the expected PMC EPN as

a function of N, p, and 6 do not show the relationship be-tween these four factors.

In qualitative analysis the asymptotic expressions for EPNare very useful. We have investigated the asymptotics whenthe learning sample size N and dimensionality p tend toinfinity simultaneously. Then

EPN=+( 6I5 (7)

where the coefficient az depends on the type of a classificationrule. The expressions for a. for four classification rules investi-gated are presented in Table III [15]. The fifth DF in thattable is designed for the classification into two normal popula-tions with independent measurements

g(X) (X - 2 )Y DT" 2! (X - X-2) (X - l)1 1 1(A

+Iln ,2 (8)1D11

244

RAUDYS AND PIKELIS: CLASSIFICATION ALGORITHM IN PATTERN RECOGNITION24

TABLE 1THE VALUES OF THE RATIO H = EPN/ P. AS A FUNCTION OF LEARNING SAMPLE SIZE N, DIMENSIONALITY P AND

MAHALONOBIS DISTANCE 45

DF 4 DF 3 DF 2 DFPi

__1.68 2.56 3.76 4.65 5.50 1.68 2.56 3.76 4.65 5.50 1.68 2.56 3.76 4.65 5.50 1.68 2.56 3.76 4.65 5.50 I

1.56 1.73 2.02 2.42 2.99 1.66 2.07 3.49 5.81 11.7 2.00 3.26 8.27 20.6 59.3 2

1.41 1.47 1.64 1.87 2.17 1.46 1.69 2.34 3.39 5.43 1.64 2.22 4.17 8.00 17.9 31.21 1.22 1.30 1.39 1.51 1.21 1.33 1.56 1.87 2.32 1.31 1.50 2.00 2.74 4.07 1.67 2.23 4.05 7.72 17.1 61.08 1.09 1.12 1.15 1.19 1.10 1.13 1.20 1.28 1.37 1.12 1.17 1.30 1.45 1.66 1.25 1.33 1.61 2.24 2.90 15

~1.04 1.04 1.06 1.07 1.09 1.05 1.06 1.09 1.13 1.16 1.06 1.08 1.14 1.20 1.27 1.12 1.14 1.23 1.38 1.63 301.02 1.02 1.03 1.04 1.05 1.03 1.03 1.05 1.06 1.09 1.03 1.04 1.07 1.09 1.13 1.05 1.06 1.10 1.17 1.27 601.01 1.01 11O1 1.01 1.02 1.01 1.01 1.02 1.03 1.03 1.01 1.02 1.03 1.04 1.05 1.02 1.02 1.04 1.06 1.10 150

1.52 1.66 1.91 2.18 2.54 1.60 2.03 3.05 4.61 7.92 2.11 3.64 9.90 25.7 76.1 31.42 1.50 1.66 1.83 2.06 1.52 1.75 2.35 3.22 4.78 1.80 2.65 5.62 11.9 28.9 41.35 1.40 1.51 1.64 1.80 1.41 1.58 2.04 2.54 3.50 1.64 2.21 4.01 7.37 15.4 5

LC\1l.19 1.20 1.24 1.29 1.36 1.22 1.27 1.40 1.54 1.89 1.32 1.51 2.00 2.66 3.78 1.74 2.35 4.16 7.46 15.5 10111.08 1.08 1.09 1.11 1.13 1.10 1,11 1.16 1.22 1.26 1.13 1.18 1.31 1.45 1.64 1.31 ~1.40 1.65 2.00 2.60 25

~1.04 1.04 1.05 1.06 1.07 1.05 1.06 1.08 1.10 1.13 1.06 1.09 1.14 1.20 1.27 1.15 1.18 1.26 1.37 1.55 501.02 1.02 1.02 1,03 1.03 1.02 1.03 1.04 1.05 1.06 1.03 1.04 1.07 1.10 1.13 1.07 1.08 1.12 1.16 1.23 1001.01 1.01 1.01 1.01 1.01 1.01 1.01 1.02 1.02 1.03 1.01 1.02 1.03 1.04 1.05 1.03 1.03 1.04 1.07 1.08 250

1.46 1.57 1.73 1.88 2.08 1.56 1.80 2.35 3.14 4.39 2.04 3.38 8.57 20.9 58.2 51.41 1.48 1.60 1.71 1.86 1.47 1.66 2.11 2.67 3.44 1.85 2.78 6.00 18.8 30.9 6

co 1.33 1.37 1.44 1.51 1.61 1.37 1.52 1.76 2.09 2.50 1.63 2.19 3.87 6.84 13.5 8II 1.18 1.19 1.21 1.24 1.28 1.21 1.25 1.34 1.46 1.65 1.32 1.51 1.97 2.59 3.58 1.80 2.46 4.41 7.26 14.0 16

1.0k8 1.07 1.08 1.09 1.11 1.09 1.10 1.13 1.16 1.21 1.13 1.18 1.31 1.45 1.63 1.38 1.50 1.74 2.06 2.55 401.04 1.04 1,04 1,05 1.05 1.05 1.05 1.06 1.08 1.10 1.07 1.09 1.14 1.20 1.27 1.20 1.23 1.30 1.40 1.55' 801.02 1.02 1.02 1.02 1.03 1.02 1.02 1.03 1.04 1.05 1.03 1.04 1.07 1.10 1.13 1.10 1.11 1.14 1.18 1.23 160

___1.01 1.01 .101 1.01- 1.01 1.-01 1.01 1.01 1.02 1.02 L 1.01 1.02 1.03 1.04 1.05 1 .04 1.04 1.05 1.06 1.08L 400

DF 4 DF 3 !)P 2 DFPi

1 1.68 2.56 3.76 4.65 5.50 1.68 2.56 3.76 4.65 5.50 1.68 2.56 3.76 4.65 5.50 1.68 2.56 3.76 4.65 5.50 N

1.46 1.57 1.70 1.82 1.95 1.53 1.74 2.14 2.68 3.45 2.12 3.64 9.71 24.6 70.6 71.36 1.41 1.48 1.54 1.62 1.42 1.49 1.72 1.98 2.42 1.75 2.50 4.88 9.41 20.2 101.32 1.35 1.39 1.44 1.50 1.36 1.44 1.61 1.82 2.05 1.63 2.18 3.77 6.51 12.4 121.18 1.18 1.19 1.21 1.23 1.20 1.23 1.28 1.35 1.45 1.33 1.51 1.96 2.54 3.46 1.85 2.79 4.56 7.64 14.0 24

c\J1.08 1.07 1.07 1.08 1.09 1.09 1.09 1.11 1.13 1.16 1.13 1.19 1.31 1.45 1.62 1.46 1.62 1.89 2.20 2.66 60

II1.04 1.04 1.04 1.04 1.04 1.04 1.04 1.05 1.06 1.08 1.07 1.09 1.14 1.20 1.27 1.26 1.29 1.37 1.46 1.59 120P.

1.02 1.02 1.02 1.02 1.02 1.02 1.02 1.03 1.03 1.04 1.03 1.04 1.07 1.10 1.13 1.14 1.14 1.17 1.20 1.25 2401.01 1.01 1.01 1.01 1.01I 1.01 1.01 1.01 1.01 1.01 I1.01 1.02 1.03 1.04 1.05 1.05 1.05 1.06 1.07 1.09 .6001.56 1.75 1.94 2.09 2.25 1.63 1.92 2.41 2.89 3.72 81.44 1.53 1.61 1.68 1.76 1.48 1.64 1.87 2.10 2.41 2.06 3.44 8.73 21.1 57.4 ~ 32 for DF 1 121.36 1.41 1.45 1.49 1.54 1.40 1.49 1.63 1.78 1.98 1.78 2.56 5.06 9,76 20.9 161.31 1.30 1.36 1.39 1.42 1.34 1.39 1.49 1.57 1.74 1.63 2.16 3.69 6.22 11.4 2.06 3.24 6.90 13.5 28.5 20*1.18 1.17 1.18 1.18 1.20 1.19 1.21 1.23 1.27 1.32 1.33 1.51 1.95 2.50 3.36 1.96 2.88 5.22 8.80 15.8 401.08 1.07 1.07 1.07 1.08 1.08 1.08 1.09 1.10 1.13 1.14 1.19 1.31 1.44 1.62 1.58 1.82 2.19 2.55 3.05 1001.04 1.04 1.03 1.04 1.04 1.04 1.04 1.05 1.06 1.07 1.07 1.09 1.15 1.20 1.27 1.35 1.42 1.50 1.60 1.72 2001.02 1.02 1.02 1.02 1.02 1.02 1.02 1.02 1.03 1.03 1.03 1.05 1.07 1.10 1.13 1.20 1.20 1.23 1.26 1.30 4001.01 1.01 1.01 1.01 1.01 1.01 1.01 1.01 1.01 1.01 1.01 1.02 1.03 1.04 1.05 1.08 1.08 1.08 1.09 1.10 1000

1.82 2.34 3.09 3.66 4.22 1.84 2.50 3.66 4.85 6.64 81.70 2.C13 2.41 2.65 2.87 1.73 2.14 2.75 3.30 3.93 '180 for DF 1 121.54 1.70 1.84 1.92 1.99 1.56 1.77 2.01 2.21 2.43 201.43 1.50 1.55 1.58 1.61 1.48 1.54 1.65 1.74 1.84 2.05 3.39 8.40 19.7 52.0 30

?~1.30 1.32 1.33 1.34 1.35 1.31 1.34 1.38 1.42 1.46 1.62 2.15 3.61 5.95 10.6 2.21 3.25 7.87 18.3 40.6 50*Il 1.18 1.17 1.16 1.16 1.17 1.18 1.18 1.19 1.20 1.25 1.33 1.51 1.93 2.47 3.27 2.13 3.12 7.10 13.1 25.1 100

1.08 1.07 1.06 1.06, 1.06.1.08 1.07 1.07 1.08 1.08 1.14 1.19 1.31 1.44 1.61 1.81 2.35 3.23 4.03 5.05 2501.04 1.03 1.03 .103 1.03 1.04 1.04 1.04 1.04 1.05 1.07 1.09 1.15 1.20 1.27 1.58 1.78 2.01 2.18 2.37 5001.02 1.02 1.02 1.02 1.02 1.02 1.02 1.02 1.02 1.02 1.04 1.05 1.07 1.10 1.13 1.37 1.42 1.47 1.51 1.56. 10001.01 1.01 1.01 1.01 1.01 I1.ul_1.01 1.01 1.01 1.01 1.01 '1.02 1.03 1.04 1.05. 1.18 1.16 1.18 1.18 1.20 2500

245

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-2, NO. 3, MAY 1980

TABLE IICOMPARISON OF THEORETICAL AND EXPERIMENTAL VALUES OF THE RATIO

X = EPN/PO

Data 1 Data 2 Data 3 Data 4P = 5,Nm = 500 p = 5,Nm = 600 P = 32,Nm = 300 P = 6,Nm = 600

N:o\ Nof 12 20 50 12 20 50 50 100 12 20 50DF

1 1.97 1.53 1.26 3.36 3.58 1.45 4.2 1.54 1.47 1.31 1.272.62 1.68 1.21 3.15 1.86 1.25 28.6 6.87 1.70 1.35 1.15

2 1.44 1.32 1.09 2.01 1.63 1.32 13.6 6.42 1.23 1.20 1.101.67 1.30 1.12 1.70 1.32 1.12 4.51 2.55 1.32 1.13 1.07

3 1.20 1.18 1.07 1.29 1.12 1.09 1.44 1.19 0.99 1.02 1.001.26 1.16 1.07 1.30 1.18 1.08 1.29 1.14 1.20 1.12 1.05

4 1.12 1.09 1.04 1.18 1.14 1.10 1.11 1.00 0.98 1.01 1.001.14 1.09 1.04 1.19 1.10 1.04 1.18 1.08 1.13 1.09 1.04

TABLE IIITHE QUALITATIVE CHARACTERISTICS OF FIVE CLASSIFICATION

RULES COMPARED

Parameters of Number of Estimated Parameters Min. N toPopulations Estimate the Number of

Estimated From Corr. Parameter Coefficients Coefficient inDF Samples Means Variances Coef. Total of DF of DF Expression (7)

p(p+3) (62+pr2+p21 XI, X2, S1, S2 2p 2p p(p - 1) p(p + 3) p 2 1 +X22 (N p)

2 2p p(p -1) p(p+5) + 1 +2pXl' 2, p p 2 2 2 pN, 2 \2N

2p

4 X1,X2 2p 2p 1 p+1 + 6N62~~~~~~~~~~~~~~~~~~~~~~~~4p5 xI)X25 19 2 2p 2p 4p 2 2p+ 1 1N62

where SD1, -D2 stand for sample estimates of diagonal disper-sion matrices.The numerical values of the ratio H = EPN/PO and asymp-

totic formulas for the expected PMC show that the classifica-tion performance EPN and the learning quantity H dependmainly on the asymptotic PMC, ratio Nlp, and the complexityof a classification rule. For small values of Nlp the classifica-tion error increases considerably, e.g., 8.8 times if p = 20,N = 40. Po. = 0.01, and quadratic DF (1) is used. In this casethe increase in the sample size N from 40 to 400 may reducethe classification error from 0.088 to 0.0126, i.e., approx-imately seven times. Since the learning quantity depends onthe DF type, the learning sample size must be determinedindividually for each discriminant function. This is clearlyseen from Fig. 1, where the curves that show the dependenceof the learning sample size N on dimensionality p for fixedasymptotic and expected probabilities of misclassification, arepresented. Fig. 1 and the asymptotic formulas from Table IIIconfirm that there is a direct relationship between N and p:for classification rules (2)-(4) and (8) the relationship is linear,for DF (1)-quadratic (only for large values of p). This con-

clusion was pointed out earlier by A. Deev [13] for thestandard linear DF (2) and experimentally obtained by Pip-berger [26] for the quadratic DF.The values for the learning quantity H, presented in Table I,

the graph, similar to Fig. 1, allow us to estimate the learningsample size, sufficient to achieve the desired accuracy ofestimating the DF coefficients. For example, let the numberof features p = 12 and suppose the asymptotic PMC PO > 0.01.Then from Table I we find the requirement that the relativeincrease in the classification error should not exceed 1.5 times,is fulfilled when we have 120 learning vectors from each classfor quadratic DF (1), 60 for standard linear DF (2) and only12 for Euclidean distance classifier (4).Another problem, caused by finiteness of the learning

sample size, concerns the classification rule type [27]. InFig. 2 the curves that show EPN versus N for two values ofasymptotic PMC PO. are presented. The expected PMC de-creases in an exponential way and the decrease rate dependson the complexity of a classification rule.In the case when the asymptotic PMC's for all classification

rules are the same, simple classification rules are preferable.

246

RAUDYS AND PIKELIS: CLASSIFICATION ALGORITHM IN PATTERN RECOGNITION

Fig. 1. Learning sample size N versus dimensionality P for DF's (1)-(4).

Fig. 2. Expected PMC EPN versus learing sample size N for DF's (1)-(4) and two different values of asymptotic PMCPW and p(2).

If the samples are Gaussian with different nondiagonal co-variance matrices the least asymptotic PMC corresponds to aquadratic DF. Therefore, for a very large learning samplesize this particular classification rule is preferable (curve 1'in Fig. 2). For small learning sample sizes (especially whenN insignificantly exceeds dimensionality p) the expected PMCof the quadratic DF tends to 2 (for equal prior probabilitiesfor the classes). In such situations simpler classification rulesmust be used instead of complex quadratic ones, or one shouldreduce the number of features. Graphically, the effect men-tioned is illustrated by Fig. 2-when the learning sample size issmall then a simple structured DF can do better than the morecomplex structured DF. The effect mentioned causes theproblem of choosing the type of a classification rule in accor-dance with the learning sample size [27]. This problem re-ceived considerable attention in Soviet scientific literature[28] - [31] .

Analysis of the representations of the conditional PMC interms of simple statistics (e.g., see Appendix, part A) shows aninteresting property of PMC: in the asymptotic considered(N -+ oo and p -+ oo) in spite of the fact that the expected PMCremains constant, the variance VPN of the conditional PMCPN decreases. The formulas for VPN have not been derivedyet. For finite values of p, N the variance VPN may be esti-mated by means of simulation (see the Appendix).Numerical values for the ratio H = EPN/P (or the expectedPMC EPN = H * P.) may be useful not only for determiningthe learning sample size and choosing the classification ruletype, but also for determining the optimal number of measure-ments [25]. The results obtained may be used to verify the

accuracy of approximate expressions for misadjustment of anadaptive least-mean-square algorithm derived by Smith [32],asymptotic expansions and approximated expressions for theexpected PMC [3], [121 - [16], e.g., by using Table I wefound that Deev's asymptotic expansion is much better thanthe well-known Okamoto's expansion (see [33]), especiallyfor small values ofp and N.

APPENDIXFORMULAS FOR THE EXPECTED PMC

A. Euclidean Distance Classifier /20], [221The expected PMC EPN may be expressed in the following

way:

EPN = q,P(g(X) <0 X Hl) + q2P(g(X)>0 IXEH2).

(A.1)In formula (A.1) ql, q2 stand for prior probabilities of the

populations, and the probability P(.) is considered with re-spect to random observation X and random parameters of theDF (Xl and X2 in the case of an Euclidean distance classifier).John [34] represented the DF (4) as the difference of twoindependent noncentral chi-square random variables

g(X, Xl, X2 IX H,i) = (-l)+' svN 1/2(XA,2,P - XA2,P)where

X1=N2/4(l - )i )2,6 2 = (i -t2)_(A2 -it2).

In order to calculate (A.1) we used an inversion formula ofImhof [35] for the distribution function of a quadratic formof normal variables

p4tE aIiI sin1(u)P{Z1:aiX2 h<a}0 ="Fn upu du (A.2)

where

0(u) = 2 E[hi arc tg(aiu) + Xk aiu(l +a u2)1] - au/2,

f(u) H {(1 + 2)hi/4} .XeupI1=1 a1u, exPt~2~ 1 +a?u

and expressed the expected PMC in the form of the followingintegral [20]:

I I r0 sin {uk162/(I +U2)}PN 2 II J (I + U2)p/2 exp {uk262(l I+ u2)} ?kI =N /2 2N+l, k2 =N(N+ 1)/(2(2N+ 1)). (A.3)

Formula (A.3) may be used for numerical calculation and tab-ulating EPN in the case of spherically normal population. In ageneral case of normal distributions with the common covari-ance matrix z an analogous expression to (A.3) is derived[20]. Numerical investigations have proved that relativelyhigh accuracy may be obtained in formula (A.3) using themodified parameters

= [(JIi - /2)'(it1 - 2)] 2 tr 2

[(KI1 - IA2)' (IAI 142)]2

247

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-2, NO. 3, MAY 1980

* =(JAI - /A2)FQ11 - A%2)(IAI - J2)'F(IA1 -IA2)

instead ofp and 6.Depending on E and iul - A2 may be p* < p or p* > p;

therefore, the expression of p* explains the influence of dis-persions and correlations on the sensitivity of the Euclideandistance classifier to the learning sample size in an explicit way.Analysis of the conditional PMC shows some interesting

which 6 = Vup -,U2)(/A1 1L2) is constant and the dimen-sionality increases, the learning sample size N, required to keepthe expected PMC EPN constant, increases linearly with p.The variance VPN, however, tends to zero.

B. The Linear DF[22JThis DF has been studied by many investigators. We used a

computation algorithm which is based upon the formula ofSitgreaves [10]:

r (N 1-) e- 2N(N+1)XA/2N+lJ

EP=rN- Pr(P- 1 r(1 j,kl,m,r=o

(-2)1(N;k2)j k+l+m+r

2(N+ ~~+ m+ r

l2N+ 1)2 +-+k

r(P - +k+m)(Nr - +j)+ r(+ r+)( + + ) I

r(P +j k l+m +r r(N- +j+ +m j!k!l!m!r! j

properties of the Euclidean distance classifier. In the case theclassifying into two spherically normal populations the condi-tional PMC is

22 {(l)l i[ - (X1 +X2)] (X1 X2)}

/(X1 - X2)(X1 - X2)

With the help of standard random orthogonal transformationsY = 2 (Xl + X2 - ,Al -u2)G where G is an orthogonal randomp X p matrix such that (Xi - X2) G = ((X1 - X2) (X1 - X2),0,0,.*, 0) it may be shown that vector y is distributedN(O, I) independently of X1 - X2 and the conditional PMC isdistributed as a function

2= { Q( 1 + 6)\N-2+X -1 246(~1+vWj~2 N

(A.4)

where tI, t2, and Xp are independent and t~jN(0, 1).Representation (A.4) was originally obtained by Blago-

veschensky (personal communication, 1970) and may be usedto obtain the asymptotic expansion for EPN and to estimatethe distribution function of the conditional PMC by means ofsimulation. In the latter case m sets of random variables 1i,j

2t2,j, Xp-_,j are generated and each time the conditional PMCPN,i is computed. The distribution function of PN is esti-mated from the sample PN,1, PN,2, * - PN,m Analogousexpressions for classification rules (1)-(3) are obtained inpapers [13], [21], [23].Representation (A.3) expresses the main properties of the

distribution of the conditional PMC. In the series of models in

where X = 6/2.This formula was transformed by Estes [5] to a form, con-

taining only two sums, but then the confluent hypergeometricfunctions I F, (-) appeared as

EPN1 2(N2)F(N 2+2)N2

2 (N pF\( )

0 F( - 2+r)r((2r) r(2)r= r(2 +2-r r(+ r +r )

/0 N2 2 / 3 N22

(2N221)F(r+±,i+r,-2j ' 2J)

N (N) (NX)2r!r(§+2+ i )

r(N + i) r ! (r - i)! rp(+ 1 + r + i)

F1 (I+ P 1++r+i, NX2)}

Estes suggested a doubly recurrent method to compute 1F1 -)values, which needed only four initial values of 1F, (). Thisresulted in a fast, but not exact algorithm for computing thevalues of EPN. The reason of poor exactness (for somep, N, 6) in this algorithm was hidden in error propagation inrecurrent computations (indicated in [5] ). Our investigations

248

RAUDYS AND PIKELIS: CLASSIFICATION ALGORITHM IN PATTERN RECOGNITION

showed that the influence of the above-mentioned errorpropagation became negligible when we used a simple recur-rent method for computing the values of IF,(). So we hadto use R + 1 initial values of IF, instead of four (R dependson p, N, 6, practically R = 10-20). The algorithm that weused for computing EPN may be expressed by the formula

rS I p) t I*)

r P(N t)r! r [ei+2+i

r-j j-r+i (p\

* I 2 a2 221E oa=

K= 2NN

and all Li, Ui, x2 are independent. From (A.7) we see that thedistribution of g(X) depends on p, N, and on a vector param-eter {Ali - 92i/2dil}. The distribution of g(X) remains thesame if we change the order of summation in (A.7). Wesuppose that the components of {Ili - ,U2j/N} are orderedin a reducing manner,

i<j.

Since

/ P (li - b2i) 6VE-= 5

is the Mahalanobis distance between classes, we see that distri-bution of the DF depends on p, N, 6 and on the reduction law(RL) of the components of {Illi - P2i /VUi}.In order to compute the PMC of DF (3) we must find the

probability

P(g(X) <0 X H1).

f1

C. A Linear Classifierfor IndependentMeasurements [211This classifier is constructed according to the assumptions of

normality, equality of CM in both classes, and independenceof measurements. The assumption on the independence ofmeasurements leads to some modifications in the estimation ofCM (only the diagonal elements are estimated), so the esti-mated CM is no longer distributed according to Wishart's law.The result is that standard techniques of random orthogonaltransformations are not helpful.Suppose that the assumptions on normality, equality of CM,

and independence of measurements are valid. Then thediagonal sample CM may be represented as

ED-1 =KOXU2 D-1 = KoD -1/2X-2 fiY1/2

where is the diagonal CM of populations, x-2 is a diagonalrandom matrix, and Ko is a scalar coefficient. The elementsof X-2 are independent and distributed as (X2N 2So the DF is distributed as a sum

g(X) = , Xi,L1Ui(A.7)

where

Li-,N2N Al l'92 I

\2 2N+ 1

Ui~~ ~Fd1 , )

xO are distributed as central x2 with 2N - 2 degrees offreedom,

The straightforward method of finding the distribution func-tion of (A.7) gives very complicated formulas; therefore, weused an approximation method.The statistics (A.7) has a finite number of finite moments.

Our investigation showed that good results are obtained if thedistribution ofg(X) is approximated by the distribution of

L = [(a + t)2 b - (c + 77)2 ] d(X2)-I, (A.9)

where t and 77 are normally N(0, 1) distributed variables, x2has a x2 distribution with n degrees of freedom; #, q, and x2are independent. The coefficients a, b, c, d are positive, n is aninteger, n > 2N - 2. It is seen that

P(L < 0) = P(U<0) where U = (a + t)2 b - (c + r7)2.

While computing probability (A.8) the integer n is chosen insuch a way that it were minimal and the first four cumulantsof U were positive. Then a, b, c, d are found fitting fourcumulants, and probability P(U< 0) is computed. We used an

approximate formula

P(U<0); 1 erfa

+erf ]

2 12~b+2 V2b+~2J

which is based on a2+ c2> 3 almost always.

The accuracy of approximation formulas is usually unknown.We used Monte Carlo techniques to define this accuracy. OurMonte Carlo method is based on expression (A.7). Assumingthat learning (evaluating of $ and Xi) is already made, theresult of recognition may be computed from the analyticexpression

(A.8)

249

Igii p2il Igii 112il-\Id-i :Pa- Nfd-i

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-2, NO. 3, MAY 1980

EPN=q5(8E[g(X)X1,X2,D ]I), XEH1,E[g(X) lxl,)X2, D]

E[g(X) X1,X27,] =Kl -+Zl X +ZAV[g(X) IX1, X2, ] =K2 (M + Z2)'2X2(m + Z2),

Z, -N(O, I), m = g -l/2(81 - I2),

2N- 2K,=NI

/1Xn

x-2 =

1

Xn

2(2N- 2)22 N

1X2Xn

n=2N- 2,

(A.1 0)

Numerical investigations showed that the error of approxima-tion tends to zero as N and/or p increase. We used approx-imation (A.9) as the main method of computation.The main difficulties in tablulating the values of H = EPN/P.

for DF (2) arise due to the vector parameter RL. The in-fluence of RL on H increases if 6 is large, and p and N aresmall. The value of H may change 50 percent when 6 = 6.2,p = 8,N= 8, and only RL is not constant.After the investigation of real data we decided to tabulate

the ratio H for a standard RL, i.e., a linear RL, for which

1/1i - I2ill=lp (, = 1,2, -

If p and/or N are large, then the distribution of (1) tends tothe normal one, and it can be shown that the error of H due toRL does not exceed 10 percent ifN> 2.5 62 + 2.

D. Quadratic DF[23JIn the analysis we may assume , = I, X2 = D (diagonal ma-

trix) and Ml = 0 without loss of generality. Denote A1 =

(N1 - 1)Sj, A2 = (N2 - 1) D-1/2S2 D-1/2 and let G1 and G2be orthogonal p X p random matrices such that

(X - X2) G2G = (62,0°, O,°

where

5 I = (X - TlX)( - X-1),62 =(X-X2)'J9 l(X-X2).

Then DF (1) may be described in the form

(N2 - l)b"S62 - (Nl - l)bl 6 1 +ln +B2I+ KO2 1 ~~~~~IB1II

where

K q- NI I IP Iq2 \N2- 1I

B1 = G1A 1 G', B2 =G2A2G'/bil B12ABi1 -B2

I I/

The matrices A1 and A2 are independent and distributedW(Ni - 1, I). Since the elements of the random matrices GIand G2 are distributed independently of the elements of A1and A2, then from Lemma 1 of A. Bowker [11] it followsthat the elements of B1 and B2 are distributed both accordingto the Wishart law W(N1 - 1, I) and independently of theelements of G1, G2, and, consequently, of651 and 62.The determinant of the matrix

B1 = ( I. )

B21 B'2may be expressed [1]

IBil = Ib', - Bi2 (B22)2B1 * IB222 = (bl1)-l IBi2

where (b! )-1 and IBi2 are distributed independently [23]

b! 1 v p and IB'22 p-XVi-jlj=l

Therefore, the representation for DF (1) follows:

-- ~~~~~N2-p(X-pi2g(X,xl x2,S1,S2lXEH1)= XN2-1 iz x di

2 L (X U)2XN1-1 j=l

2

+Zln 2

1=1 XNI-l(A.ll)

where random variables are distributed independently

N(O, 1), X EHIl,

N(,u2j, dj), X E 112, 12 = (921, ' ** 2p),

Representation (A.1 1) is the key to obtain the representationof the conditional PMC in terms of simple statistics and toobtain an exact formula for the expected probability of mis-classification. The representation of the conditional PMC ofthe first kind, i.e.,

PN P{g(X) <OIX H,Xl,JX2,S 1, S2} (A.12)

is obtained by applying Imhof's formula (A.2) to calculate(A.12). The expected PMC of first kind is found accordingto the following scheme:

250

RAUDYS AND PIKELIS: CLASSIFICATION ALGORITHM IN PATTERN RECOGNITION

EPN=P{g(X,X1JX2,S1,S2)<OIXEll1}

Jf [fJ P{ (xi r'i, Uijxl-p = S, XN2_P =

2p-l XN2-i = wo )IZn 2 oXEHi)<01

*J(wo)dwo ]J(s)J(t) ds dt

where a random variable wo is assumed to be normal [numer-ical calculations for a wide range of p, N, 6 have shown thatthe variance of random variable wo is within 1 percent of thevariance of random variable (A.11)] .Applying Imhof's formula (A.2) to calculate P{g(-) <O} and

after the following integration according to wo we obtain

REFERENCES

[1] T. W. Anderson, An Introduction to Multivariate Statistical Anal-ysis. New York: Wiley, 1958.

[2] S. John, "Errors in discrimination," Ann. Math. Statist., vol. 32,pp. 1122-1144, 1961.

[3] M. Okamoto, "An asymptotic expansion for the distribution ofthe linear discriminant function," Ann. Math. Statist., vol. 34,pp. 1286-1301, 1963.

[4] G. F. Groner and B. Widrow, "Preprocessing of information forpattern recognition," in Summaries of Papers, Int. Conf: Micro-waves, Circuit Theory, and Inform. Theory, Tokyo, Japan, Sept.7-11, 1964, p.3.

[51 S. E. Estes, "Measurement selection for linear discriminant usedin pattern classification," Ph.D. dissertation, Stanford Univ.,Stanford, CA, 1965.

[6] P. A. Lachenbruch and M. R. Mickey, "Estimation of error ratesin discriminant analysis," Technometrics, vol. 10, no. 1, 1968.

[7] G. J. McLachlan, "Estimates of errors of misclassification on thecriterion of asymptotic mean square error," Technometrics, vol.16, no. 2, pp. 255-260, 1974.

[8] V. E. Troicky, "Probabilities of an error in linear discrimination,"in Proc. Moscow Electrotech. Inst. Commun. (in Russian), Mos-cow: Communication, 1968, pp. 106-113.

EPN= 1_ 11Er (S)(Nl-p-2/2)(t)(N2-p-2/2)e-(S+t/2)PN 2 nJO JO NO -r(N2P)p(NP)QN, +N2 - 2p

sin {- ac[ tg(hii U)+hi- ]+ U [lnt+ m +ko]}

eVwU2Un H{12+hp2U2}l4expS£ (hSi,U)2 }1=12 21=111

where

Si = -('ij - l)g2jl/ij, h1ij= a2ijl/ [(1 - 'Yij)(l - zYilzkj)]k=3- i,

I=1 + (1 -_yi)2dj/N2 + yhI/Ns,t21 = (N2 - l)/(tdi), ail = (1 - N,)IS,Yij = (sj + aZ1 + a2j + (-Q1)iV( + + a2) -j

N02jdi NL1j

"' N22 p-1

mw i=1E E"i-1 j=i

2 p-1 °°VW=4> E £ (N -j+2+2k)2.

i=1 j=1 k=O

When the covariance matrices of the populations E1 and 12are different then the individual classification errors of kinds Iand II are unequal. The analysis of asymptotical formula, sim-ilar to formula (7), has shown that if asymptotical PMC ofkind I PI is less than the asymptotical PMC kind II PI thenEJT/PIO>pzEPII/PII. The exact values of the expected PMCof the second kind may be calculated by using formula (A.3),substituting , and ![ with , S-T and T -

.

M. A. Moran, "On the expectation of errors of allocation asso-

ciated with a linear discriminant function," Biometrika, vol. 62,pp, 141-148, Jan. 1975.R. Sitgreaves, "Some results on the distribution of the W-classifi-cation statistic," in Studies in Item Analysis and Prediction.Stanford, CA: Stanford Univ. Press, 1961, pp. 241-26 1.A. H. Bowker, "A representation of Hotelling's T2 and Ander-son's statistic W in terms of simple statistics," in Contributions toProbability and Statistics. Stanford, CA: Stanford Univ. Press,1960, pp. 142-150.A. H. Bowker and R. Sitgreaves, "Asymptotic expansion for thedistribution of the W-classification statistics," in Studies in ItemAnalysis and Predication, H. Solomon, Ed. Stanford, CA: Stan-ford Univ. Press, 1961, pp. 293-311.A. D. Deev, "Asymptotic expansions for distribution of statisticsW, M, Wl in discriminant analysis" (in Russian), in Statist.Methods of Classification, Yu. Blagoveschenskij, Ed. Moscow:Moscow Univ. Press, 1972.P. A. Lachenbruch, "On expected probabilities of misclassifica-tion in discriminant analysis, necessary sample size and a relationwith the multiple correlation coefficient," Biometrics, vol. 24,no. 4, pp. 823-834, 1968.S. Raudys, "On the amount of priori information in designingthe classification algorithm" (in Russian), Proc. Acad. Sci. USSR,Tech. Cybern., no. 4, pp. 168-174, 1972.I. S. Enukov, "Choice of a set of measurements with maximaldiscriminating power in the case of limited learning sample size"(in Russian), in Multivariate Statist. Analysis in Soc.-Econ. Re-search. Moscow: Nauka, 1974, pp. 394-397.A. K. Jain and W. G. Wailer, "On the optimal number of featuresin the classification of multivariate Gaussian data," Dep. Comput.Sci., Michigan State Univ., Tech. Rep. TR 77-04.S. Marks and 0. J. Dunn, "Discriminant functions when covari-ance matrices are unequal," J. Amer. Statist. Ass., vol. 69, pp.555-559, June 1974.

du ds dt, (A.14)

251

2 Ni i 00 I(- ol + F,N.-i 2 K(K + (Ni 1)/2) '

[1411 K=l

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-2, NO. 3, MAY 1980

1191 J. Van Ness and S. Simpson, "On the effects of dimension indiscriminant analysis," Technometrics, vol. 18, pp. 175-187,May 1976.

[20] S. Raudys, "Determination of the realization quantity of therecognizable objects for the formation of the recognition systemsdecision rule" (in Russian), in Automatic Input of the Writtenand Printed Characters into Computers, Proc. 2nd All-UnionConf., A. Nasliunas, Ed. Vilnius, 1969, pp. 71-81.

[211 V. S. Pikelis, "The error of a linear classifiers with independentmeasurements when the learning sample size is small" (in Rus-sian), in Statist. Problems of Control, issue 5. Inst. Math. andCyb. Print, Vilnius, 1973, pp. 69-101.

[22] S. Raudys and V. Pikelis, "Tabulating of the probability of mis-classification for a linear discriminant function" (in Russian), inStatist. Problems of Control, issue 11. Vilnius, 1975, pp. 81-120.

[23] S. Raudys, "The error of classification of a quadratic discrim-inant function" (in Russian), in Statist. Problems of Control,issue 14. Vilnius, 1976, pp. 33-38.

[24] S. Raudys, V. Pikelis, and K. Juskevicius, "Experimental compar-ison of thirteen classification algorithms" (in Russian), in Statist.Problems of Control, issue 11. Vilnius, 1975, pp. 53-63.

[25] S. Raudys, "Limitation of sample size in classification problems"(in Russian), in Statist. Problems of Control. Vilnius, 1975, pp.5-183.

[26] H. V. Pipberger, "Computer analysis of electrocardiograms," inClinical Electrocardiography and Computers, C. A. Caceres andL. S. Dreifus, Eds. New York: Academic, 1970, pp. 109-119.

[27] S. Raudys, "On the problems of sample size in pattern recogni-tion" (in Russian), in Proc. HIAll-Union Conf Statist. Methods inControl Theory. Moscow: Nauka, 1970, pp. 64-67.

[28] -, "On the problem of selecting a classification rule" (in Rus-sian), in Statist. Problems of Control, issue 5. Vilnius, 1973.pp. 46-69.

[29] I. S. Enukov, "Choice of the decision rule in the case of limitedsample size" (in Russian), in Statist. Problems of Control, issue14. Vilnius, 1976, pp. 127-136.

[30] E. V. Troicky, "Investigation of robustness of classification algo-rithms" (in Russian), in Proc. VII All-Union Conf: on CodingTheory and Inform. Transmission, part VI. Moscow: Vilnius,1978, pp. 125-1 30.

[31] M. N. Libenson, "On the question of choosing the type of a dis-crimination function type in pattern recognition" (in Russian),in ProbbkmsofRandom Search, issue 6. Riga: Zinatne, 1978.

[32] F. W. Smith, "Small sample optimality of design techniques forlinear classifier of Gaussian patterns," IEEE Trans. Inform.Theory, vol. IT-18, no. 1, pp. 1 18-126, 1972.

[33] V. S. Pikelis, "Comparison of methods of computing the ex-pected classification error," Automat. Remote Control (USSRJournal), no. 5, pp. 59-63, 1976.

[34] S. John, "The distribution of Wald's classification statistics whenthe distribution matrix is known," Sankhya, Ind. J. Statist., vol.21, pp. 37 1-376, 1969.

[35] J. P. Imhof, "Computing the distribution of quadratic forms innormal variables," Biometrica, vol. 48, no. 3, pp. 419-426, 1961.

Sariunas Raudys, photograph and biography not available at the time ofpublication.

Vitalijus Pikelis, photograph and biography not available at the time ofpublication.

Correspondence

A Least Mean-Squared Error Approachto Syntactic Classification

ALAN J. FILIPSKI

Abstract-The usual approach to numerical training in syntactic pat-tern recognition involves estimation of the production probabilitiesof stochastic context-free grammars. Here we describe a differentapproach, namely that of finding an LMSE discrininant hyperplanebetween sets of class samples in a space of "structural indices" deter-mined by a context-free grammar.

Index Terms-Discriminant grammars, least mean-squared error, sto-chastic grammars, structural indices, syntactic pattern recognition.

Manuscript received March 15, 1979; revised August 29, 1979. Thiswork was supported by the National Science Foundation under GrantMCS 77-22095.The author is with the Department of Mathematics, Arizona State

University, Tempe, AZ 85281.

I. OBJECTIVES

In grammatical pattern recognition it is often useful to at-tach numerical information to each production rule of thegrammar being used. Given a sentence to classify, we firstparse the sentence, i.e., determine which production rules wereused to generate that sentence, and then use some function ofthe associated numerical values to classify the sentence. Anexample of this general approach is the use of stochasticcontext-free grammars in pattern recognition where the nu-merical values are probabilities, the function applied is multi-plication, and classification is done by comparing probabilitiesfrom different parses. In this case, determination of thenumerical values from training samples thus becomes a prob-lem of probability estimation.This paper presents another method using discriminant

grammars which combine the numerical values in an additiveway. A decision is then made by thresholding the sum. In thisformulation, a single grammar is used instead of one per class.It is shown that determination of the numerical values fromtraining data is equivalent to finding discriminant hyperplanes

0162-8828/80/0500-0252$00.75 O 1980 IEEE

252