11
Comp & .Wuths ai,h Appls ~'\Jl 12A. No 2. pp. 1~7-207. IqS6 IMl~'r-4'~43g6 $30l~ - I)lp Prmtcd in Great Britain. ~ 19;~6 Pergamon Pro,, Ltd PARAMETRIC AND KERNEL DENSITY METHODS IN DISCRIMINANT ANALYSIS: ANOTHER COMPARISON B. J. MURPHY and M. A. MORAN Department of Statistics. University College. Cork, Ireland Abstract--Comparisons of parametric and nonparametric approaches to discriminam analysis have been reported in which the kernel density method was surprisingly superior to conventional methods such as the linear discriminant function under conditions of normality. The assumptions underlying these com- parisons, particularly those of independence, and their implications for product kernel methods are critically examined. A new comparison is made allowing for correlation. It is found that for independent or modestly positively correlated variables, the kernel method is superior to conventional parametric methods unadjusted for the particular correlation structure. With other correlation structures, however. the kernel method behaves erratically and may give poor classification or poor estimation of odds or both depending on the underlying parametric configuration. I. INTRODUCTION It is well known that if the number of variables in an estimated discriminant function is increased without a compensatory increase in the real separation between the population concerned, a deterioration in classificatory performance results. An investigation of this effect by Van Ness and Simpson[l], in which the effects of increasing dimension on linear, quadratic and nonpara- metric discriminant functions were compared under classical normality assumptions, provided some surprising results. It was found that the nonparametric methods were stable with increasing dimension, and that they could outperform the parametric methods with as few as two variables and sample sizes as large as 20. Indeed, the nonparametric method gave a comparable per- formance to a linear discriminant function with known dispersion matrix. In this paper the assumptions underlying the above investigation are reexamined. A new comparison is made and the implications of correlation between the variables for the nonpara- metric approach is investigated. Both classification and estimation of odds are considered. 2. PARAMETRIC METHODS If the distribution of x in Il, is p-dimensional normal with mean p., and covariance matrix X, i.e. Np(p., X), then {'} ft(x) = (21"I)-1/2PIX[ -''2 exp - ~ oJ,(x) , where (2.1) co,(x) = (x - p.,)'X-'(x - ~t,). With equal prior probabilities for the two populations the true log-odds in favor of x from II~ is 1 LOr = ~ {-oJ.(x) + oJ.,(x)} = (~ - p..O'X -I x - ~(p.i + ~2) , (2.2) corresponding to the optimal linear discriminant function[2]. With independent random samples x,, ! ~ i ~ n, of size n, available from l-I, t = 1, 2. the unknown parameters ~.~ and ~ are replaced by the conventional estimators :it and S to give cJ~m, x2:2^.-c 197

Parametric and kernel density methods in discriminant analysis: Another comparison

Embed Size (px)

Citation preview

Page 1: Parametric and kernel density methods in discriminant analysis: Another comparison

Comp & .Wuths ai,h Appls ~'\Jl 12A. No 2. pp. 1~7-207. IqS6 IMl~'r-4'~43 g6 $30l~ - I)lp Prmtcd in Great Britain. ~ 19;~6 Pergamon Pro,, Ltd

P A R A M E T R I C A N D K E R N E L D E N S I T Y M E T H O D S

IN D I S C R I M I N A N T A N A L Y S I S : A N O T H E R C O M P A R I S O N

B. J. MURPHY a n d M . A . MORAN Department of Statistics. University College. Cork, Ireland

Abstract--Comparisons of parametric and nonparametric approaches to discriminam analysis have been reported in which the kernel density method was surprisingly superior to conventional methods such as the linear discriminant function under conditions of normality. The assumptions underlying these com- parisons, particularly those of independence, and their implications for product kernel methods are critically examined. A new comparison is made allowing for correlation. It is found that for independent or modestly positively correlated variables, the kernel method is superior to conventional parametric methods unadjusted for the particular correlation structure. With other correlation structures, however. the kernel method behaves erratically and may give poor classification or poor estimation of odds or both depending on the underlying parametric configuration.

I. INTRODUCTION

It is well known that if the number of variables in an estimated discriminant function is increased without a compensatory increase in the real separation between the population concerned, a deterioration in classificatory performance results. An investigation of this effect by Van Ness and Simpson[l], in which the effects of increasing dimension on linear, quadratic and nonpara- metric discriminant functions were compared under classical normality assumptions, provided some surprising results. It was found that the nonparametric methods were stable with increasing dimension, and that they could outperform the parametric methods with as few as two variables and sample sizes as large as 20. Indeed, the nonparametric method gave a comparable per- formance to a linear discriminant function with known dispersion matrix.

In this paper the assumptions underlying the above investigation are reexamined. A new comparison is made and the implications of correlation between the variables for the nonpara- metric approach is investigated. Both classification and estimation of odds are considered.

2. P A R A M E T R I C M E T H O D S

If the distribution of x in Il, is p-dimensional normal with mean p., and covariance matrix X, i.e. Np(p., X), then

{ ' } ft(x) = (21"I)-1/2PIX[ - ' ' 2 exp - ~ oJ,(x) ,

where (2.1)

c o , ( x ) = ( x - p . , ) ' X - ' ( x - ~ t , ) .

With equal prior probabilities for the two populations the true log-odds in favor of x from II~ is

1 LOr = ~ {-oJ.(x) + oJ.,(x)}

= ( ~ - p..O'X -I x - ~(p.i + ~2) , (2.2)

corresponding to the optimal linear discriminant function[2]. With independent random samples x,, ! ~ i ~ n, of size n, available from l-I, t = 1, 2.

the unknown parameters ~.~ and ~ are replaced by the conventional estimators :it and S to give

cJ~m, x2:2^.-c 197

Page 2: Parametric and kernel density methods in discriminant analysis: Another comparison

t98

an estimative density

B. J. MURPHY and M. A. MORAN

r,(x) = (21-I) -~ -'"lsl -~'-"

where

w,(x) = (x - i , ) 'S-~(x - i,).

The corresponding estimated log-odds is

1 L O e = ~ { - w , ( x ) + w . , ( x ) }

= (il -- R_,)'S -I x - 9[(il + R2) . (2.3)

which, apart from a constant term, is the classical linear discriminant function[3]. Now LOE has a substantial bias, which may be corrected for (see [4]), to give the unbiased

estimator

LOt; n~ + n , - p - 3 LOe l p ( n t - n,] = " - " - . ( 2 . 4 ) nl + n,_ - 2 ,. \ nln,_ /

In the predictive or Bayesian approach[5], conventional noninformative prior distributions are assigned to Ix, and ~ and a predictive density p,(x) conditional on R, and S is obtained, where

2 ) (c,/ml-I)l':PlS I ~'-'{1 (c,/m)w,(x)} -'''-`'÷",

with m = nt + nz - 2 and c, = n,/(n, + 1); t = 1 and 2. The corresponding predictive estimator of true log-odds is then

1 ~'nt(n2 + 1)} 1 {~ + (c,/m)w,(x)~ LO, = ~ p l n I,n,.~n~ + ' i i - 2 ( n ' + n, - l) ln (2.5) - + (c,/m)w:(x)J"

If observations are allocated to I-I~ or 1-I2 on the basis of the sign (2.3), (2.4) and (2.5), then, for equal sample sizes n~ = n,., the outcome will simply depend on which of w~(x) or Wz(X) is least, and all three parametric methods will result in identical classifications.

3. THE PRODUCT KERNEL METHOD

Recent reviews of nonparametric density estimation have been given by Wegman[6] and Fryer[7]. The nonpararnetric estimator of a continuous p-dimensional density f , given by the kernel method, is

K, , i = l

where x~, 1 ~ i ~ n, denote a random sample of n observations from f , h(n) > 0 is a smooth- ing parameter, and Kp(z) is a kernel function which is positive and integrates to 1. Conditions may be imposed on K and h which ensure that K is asymptotically unbiased and point-wise con- sistent in mean square error[81. Among the various types of kernel functions considered by Cacoulios was

Kp(z) = rI K(zj), )=1

Page 3: Parametric and kernel density methods in discriminant analysis: Another comparison

Parametric and kernel density methods 199

or product kernel giving

k(x) = K . = j = l

The particular choice of kernel function K does not appear to be critical[9], provided the smoothing parameter h is properly chosen. A common choice of function[ 10, I 1], and that used by Van Ness and Simpson[l] is

h~ gp = (21])-":PIhM] - ' ' - ' exp - ~ h - ' ( x - x , ) 'M-J (x - xi)

or a p-dimensional normal density with covariance matrix hM, where M is a diagonal matrix. Thus Kp(z) = Np(0, hM) is a product o f p univariate normal densities. If it is assumed that the n observations have been standardised, one may take M = I, a p x p unit matrix; otherwise, M would be a diagonal matrix of estimated variances as adopted by Habbema et al.[ll].

The choice of a value for h, the smoothing parameter, is critical in the estimation process. If h is too small, the estimated density will have sharp peaks at the sample observations and be ragged elsewhere; if h is too large, excessive smoothing will occur resulting in bias. Several methods for choosing h have been suggested[7].

If f were known, h could be chosen to minimise the integrated mean square error

IMSE = f MSE(x) dr,

where

MSE(x) = E({k(x) - f(x)}-' Ix]

is the mean square error of k(x) as an estimator of f (x ) for fixed x. For f (x ) = Np(g., [ ) and K(z) = Np(0, hM)

IMSE = (41-I)- ~j'p

x [lY.,[-"-" - 2"- ' " - ' lhM + 2~1-"-" + n- ' IhM[- ' " + (n - I)IhM + ]£[-"'-/n].

This expression is valid for all positive definite matrices[9]. In practice, of course, f is not known, but we have found that the cross-validation method of choosing h, proposed by Habbema et al.[l 1], gives values of h which approximate quite well to those obtained by minimising the IMSE.

4. STANDARD CANONICAL FORMS AND LOSS OF GENERALITY FOR PRODUCT KERNEL ESTIMATORS

The distribution of x in rI, has been assumed to be Np(lt, ~ ) for t = 1, 2, and the usual canonical form adopted in such cases is

f~(z) = Np(0, I), f_,(z) = N.(0, I), (4.1)

where I is a p × p unit matrix, all p elements of the vector 0, and all but the first of the vector 0 are zero.

As it was implicitly assumed in Van Ness and Simpson[l l and Van Ness[13] that com- parisons involving product kernel density estimators could be based on such canonical forms without loss of generality: a closer examination is worthwhile.

A transformation which produces the canonical forms (4.1) is

y = B ( x - I t l ) , ( 4 . 2 )

z = Ay. (4.3)

Page 4: Parametric and kernel density methods in discriminant analysis: Another comparison

200 B. J, MURPHY and M. A. MORAN

The matrix B = D- ~ -" C, where D is diagonal containing the eigenvalues of ~t and C is orthogonal with columns consisting of the corresponding normalized eigenvectors of ~. The matrix A is orthogonal with its first row proportional to the elements of the vector

B(p.: - Ix~). The product kernel density estimator

k(x) = (21"I)-1'2~(hl)-~'2n71 ~ exp - ~ h-I(x - xi)'(x - x,) i = l

becomes, under transformation (4.2) above,

k(y) = (21-l)-t"UlhD-11-''n:~ exp - ~ h-~(Y - Yi) (Y - Y,) • i = l

It thus remains a product kernel density estimator and only involves simple scale adjustments. With the full transformation (4.2) and (4.3), however,

k ( z ) = (2II)-t 'ZPlhAD-tA'l-t ':n,-~ exp - ~ h-t(z - zi) 'ADA'(z - zi) i = l

corresponding to use of a kernel function Kp(z) = Np(O, hADA'), which is not a product kernel because ADA' is not diagonal in general.

This could have implications for the studies of Van Ness and Simpson[l] and Van Ness[13] in which the standard canonical forms (4.1) were adopted as a basis for simulation studies. In Van Ness and Simpson[ I], five different allocation rules were compared: (i) the linear discrim- inant function with X known, (ii) the linear discriminant function with X unknown, (iii) the quadratic discriminant function with ~t~ # ~2, but unknown, (iv) product normal kernels and (v) product Cauchy kernels.

These studies found the product kernels gave superior classificatory performance to the usual linear discriminant rule (ii). This superiority was already noticeable forp = 2 and became more marked as p increased. Indeed, the allocation performance of the product kernel rules in Van Ness and Simpson[l] mirrored that of the linear discriminant function for X = I known! This latter phenomenon can be explained by the vet')" large values of the smoothing parameter h used by Van Ness and Simpson[l]. These were obtained by generating additional observations from IIi and 1-I., and choosing h to maximize correct allocation of the additional observations. Values that resulted were, for example, h = 2.5 forp = 1 and n = 10 for the normal kernel, and h = 49.0 for p = 20 and n = 20 for the Cauchy kernel. By comparison, the value of h that minimizes the IMSE (3.1) for the normal kernel with p = 1 and n = 10 is only h = 0.575. Specht[10], moreover, has shown that as the common smoothing parameter h ~ ~, the clas- sification rule for product normal kernels with a common covariance matrix hi tends to the linear discriminant function with X = I known. A similar result can be shown for the product Cauchy kernel. As the studies discussed above implicitly assume X = I, their conclusions cannot apply to the case of correlated variables which is likely to obtain in practice. It might also be noted that if independence could be assumed, the parametric methods could be modified appropriately.

5. STUDY OF THE SENSITIVITY OF THE PRODUCT KERNEL METHOD TO CORRELATION BETWEEN THE VARIABLES

In this section we consider how the product kernel method behaves when the variables are correlated and reasonable values of the smoothing parameter are used. We also investigate whether good classification behaviour is accompanied by good estimation of true log-odds. Remme et aL[ 14], in a simulation comparison of linear, quadratic and product kernel discrim- ination, noted that the product kernel method may be sensitive to the presence of correlation in the data and advised caution.

We will consider the case of equicorrelated variables where, without loss of generality,

Page 5: Parametric and kernel density methods in discriminant analysis: Another comparison

Parametric and kernel density methods

variables may be individually scaled to unit variance and 201

=

"1 . . . p- p 1 . . . p

p p . . . 1

(5.D

The transformation (4.2) ensures that we may further assume that f ,(y) = Np(0, I), f2(Y) = N(0, I) and k,(z) = Np(0, hD-f) , where D is diagonal containing the eigenvalues of ~ , viz. h~ = 1 + (p - I)p and hi = 1 - p otherwise, and 0 = D -z -'CB, where C is a Helmart matrix and tJ = (IX 2 - lUh) contains the standardized difference in means between the popu- lations.

The resulting kernel density estimator of log-odd is

LOK = ln[k~(y)/k,.(y)l

= In { 1 'D exp ~ h- I (y yi,) (y

where, for sake of simplicity, the sample sizes are nj = n2 = n. A single smoothing parameter h was chosen to minimize the IMSE (3.1). While this admittedly gives the kernel method an advantage it would not enjoy in practice, it suffices to check if the remarkable superiority of the kernel method reported in the earlier study is maintained. The optimal value o fh is a function of both n and p. The values of the parameters n, p and A, where A: = B ' ~ - t S , were chosen to be comparable to those used by Van Ness and Simpson[l], viz. n = 12 and p = 1, 4, 8 and 16 and ~ ( - ½ A ) = 0.3, 0.2, 0.1 and 0.05 where ~( . ) denotes the standard normal distri- bution function.

First, the case of two variables gives some insight into the more general case. For p = 2, the above parameter configurations for li in relation to ~ are

(i) B, = B 2 > 0 , (ii) B, = - 8 , , (iii) 8t > 0 , 8 2 = 0.

After the transformation (4.2) we obtain 0 = (0,02)' and ~t = I, with corresponding config- urations

(i) 0t = A, 02 = O,

(ii) 0t = 0, 02 = A

and

(iii) 0, = A ~ / ~ ( I - p), 0 _ , = A ~ / I ( I + p).

The corresponding kernel density estimator of log-odds is

LOx = in

i-t exp - ~ (Yl - yi,,)"(l + p ) + (y , . - y i :_ ) ' - ( l - p ) ] }

i= jexp - ~ h - I (y, _ y/~, + O,)-'(l + p ) + (y_ , - y3., + 0_,)-'(1- p)]}' (5.2)

Page 6: Parametric and kernel density methods in discriminant analysis: Another comparison

202 B.J . ML'RPHY and M. A. MORA,'q

where Y,~i and y*j = y~_,j - 0 i are independently and identically distributed as N(0, 1) f o r j = 1 ,2 .

While the above expression involves the unknown parameters 0t, 0, and p, which were introduced by the transformation to canonical form, we stress that on the original scale of the data this corresponds to use of product kernels involving no unknown parameters.

In case (i) above, the numerator and demoninator of (5.2) are increasingly dominated by their second terms as p ---) - 1 and LOx tends to zero. In case (ii), the first terms dominate, and LOx approaches zero as p ---, + 1. In case (iii), we can expect inconsistent behaviour as p-" ) - 1 .

The above parameter configurations when generalised to the case of p > 2 variables are

(i) 8i = 8 f o r i = 1 ,2 . . . . . p,

(ii) 8i = 8 fo r i = 1 ,2 . . . . . p - 1 ,8 e =

(iii) 8i = 8 f o r / = 1,8i = 0otherwise ,

- ( p - 1)8

with Z given by (5.1) or, after transformation to Z = I.

(i) 0~ = A, 0 = 0otherwise ,

(ii) 0p = A 0 = 0 otherwise,

~ / ) I - p (iii) B~ = A [I + (p - 2)p]'

~/ 1 + ( p - l)p Oi = A i ( i - 1)[1 + (p - 2)p f o r / = 2, 3 . . . . . p.

For the models listed above, a simulation study was undertaken to investigate the per- formance of the kernel method as compared with the parametric methods in regard to classi- fication and estimation of log odds and, in particular, the sensitivity of the product kernel method to the underlying correlation configuration.

As the choices for p were 1 ,4 , 8 a n d 16 a n d ( p - 1) -~ < p < l, the values of p were taken to be - 0 . 0 6 6 , 0.2, 0.3, 0.4, 0.5, 0.6 and 0.8. While the first and last values admittedly correspond to extreme cases of virtual singularity in the dispersion matrix ~ , the intermediate values allow the robustness performance of the product kernel to the presence of correlation to be assessed. For p = 0, configurations (i)-(iii) coincide with the situation studied by Van Ness and Simpson[ 1 ]. Pairs of samples from a standard multivariate normal population were generated 100 times for each value of p. Corresponding pairs of samples for different values of A and p were constructed from the original pair. A further 100 observations were generated from the first population to serve as test observations for comparing classification and odds estimation for the different methods.

6. D I S C U S S I O N OF R E S U L T S

The overall proportions of observations misclassified are presented in Table 1. For equal sample sizes, the parametric methods give identical classifications and a single result is presented. Similarly, only one result is presented for the kernel method when p = 1. as different values of p are not applicable. The results in Table 1 confirm the superiority of the kernel method for high dimension and uncorrelated variables obtained by Van Ness and Simpson[l l . However, the superiority is not as marked as in their study, and. contrary to their results, the classification performance of the kernel method was inferior to the linear discriminant function or parametric method in all cases for p ~< 8. Moreover, when a reasonable value for the smoothing parameter is used, the rates of misclassificaation no longer follow the pattern of a linear discriminant function for known dispersion matrix.

The sensitivity of the kernel method to the presence of correlation in the variables may be judged from the results presented in Table l for cases (i)-(iii) of the parameter configurations.

Page 7: Parametric and kernel density methods in discriminant analysis: Another comparison

Parametric and kernel density methods 203

.-~ ==~ •

• - - 0

,"

- o - ~ °

, ~ °~

~-~ ~ , ,

~ U

• = .~ ~ .

~ <

~ ~ " o ~ e~

o .

0

u ~ u

° ~

E

N N N ~

~ N N N P P

N N N P P P

Q ~

I t I I I I I I I I I I I I I I

( 1 ( . ~

Page 8: Parametric and kernel density methods in discriminant analysis: Another comparison

"Cub

ic 2.

Me;

ms

o1" d

ilTcr

c.I

cslim

uh)r

s o1

' log

(~ld

s I'~

)r va

rious

cas

es o1

" cor

rela

tion

slru

clur

¢ (s

ampl

e si

zes

Dis

tanc

e (n

) N

o.

and

of

pro

ba

bil

ity

of

vars

. m

iscla

ssif

ica

tio

n (

PM

C)

p

=

1.04

9 PM

C =

. 3

0

= 1

.683

PM

C =

.20

A =

2.5

63

PMC

= .

10

A =

3.2

90

PMC

= .

05

Para

metr

ic

est

imato

rs

LO(E

) LO

(U)

LO(P

)

I 0.6

0.6

0.5

4

0.7

0.6

0.5

8

0.9

0.6

0.5

16

2.4

0.6

0.5

1

1.6

1.4

1.3

4

1.8

1.4

1.3

8

2.4

1.4

1.3

16

6.2

1.4

1.3

1 3.

6 3.

3 2

.7

4

4.3

3.

3 2

.7

8 5.

6 3

.3

2.7

16

14

.5

3.3

2.7

1 6.

0 S

.4

4.2

4

7.0

5.4

4.2

8

9.2

5.4

4.2

16

23.8

5.4

4.2

., =

.:

- 12

)

LO(K

) fo

r a

ll

case

s an

d p

= 0

0.4

0

.5

0.3

0

.4

1.0

1.3

0

.9

0.9

2.6

3.

0 2

.2

2.1

4.6

4.9

3.8

3

.4

tO(K

) fo

r ca

se (

i)

and

follo

wln

9 v

alu

es

of

p.

-.066

.2

.4

.8

0.5

0.7

1.0

2.1

0.1

0.5

0

.8

2.5

0.0

1.0

1.6

5.6

I.I

1.8

2.4

5.7

0.5

1.6

2.4

7.1

0.0

2.4

4.1

1

4.6

2.6

4.2

5.9

1

5.5

1.3

4.0

6.4

2

0.8

0

.0

5.9

1

0.6

37

.0

4.2

7.2

1

0.3

29.1

2.3

7.2

1

1.9

40

.4

0.1

1

0.2

1

9.3

63

.7

LO(K

) fo

r ca

se

(li

) IL

O(K

) fo

r ca

se (

i|i)

an

d fo

llo

win

9 v

alu

esl

and f

ollo

win

g v

alu

es

of

p.

, ,

~of

p.

-.066

.2

.4

.8

0.5

0.4

0

.4

0.4

0

.3

0.2

0

.2

0.2

0

.3

0.3

0

.3

0.5

1.2

1.1

l.

O

0.8

0.9

0

.8

0.7

0.7

0.8

0.7

0.7

0.9

2.9

2.

5 2

.3

1.9

2.3

2

.0

1.9

1.6

1.

9 1.

7 1.

7 1.

7

4.9

4.1

3.8

3.1

3.9

3.3

3.2

2.8

3.3

2.8

2.8

2.7

-.0

66

.2

.4

.8

0.5

0.5

0.5

0

.4

0.3

0

.3

0.3

0

.3

0.1

0.4

0.4

0.

6

1.2

1.2

1.1

1.0

0.9

0.9

0.9

0.8

O

.l 0.9

0.9

1.0

2.8

2.

8 2

.6

2.1

2.2

2.2

2.1

1.9

0.3

1.

9 1.

9 2

.0

4.8

4.7

4.4

3.5

3.8

3.7

3.5

3.1

0.5

3.2

3.1

3.1

¢0

-<

¢L

>

>

Page 9: Parametric and kernel density methods in discriminant analysis: Another comparison

Parametric and kernel density methods 205

• • • • , , o • . ° I • ° •

~ 3 o • • I , , . I . ° , I

0 ~ N I • • • I • , ° I • . . I , , °

:S--

N A O

,

g •

,,-- ~

- ~ ° ~ -~' ~-~ " Jl v ~

. . J

N N - J

~ . , - ~

.~ - ~

N E

° . ° ° . , I . . o I . , ,

i ~ i ~ i ~ ~ • - ° * ° . , °

P

. . .

• , ° , , ° .

N m N ~

• ~ . N = ~

II II II If II II II II

Page 10: Parametric and kernel density methods in discriminant analysis: Another comparison

206 B.J. MURPHY and M. A. MORAN

In case (i), positive correlation is helpful while negative correlation is not. In the admittedly pathological case of negative correlation and large p, when the dispersion matrix is nearly singular, the kernel method collapses completely. In case (ii) the deterioration of the kernel with increasing positive correlation is exhibited as planned, and we note that inferiority to the parametric procedure sets in for quite modest values of p. In case (iii). deterioration in the performance of the kernel method shows up at both ends of the correlation scale, although it can still outperform the parametric methods if the populations are sufficiently well separated.

The comparison of parametric methods and the kernel as estimators of log-odds gives some insight into their classificatory performance. Table 2 gives expected log-odds with expectations taken over repeated samples and repeated test observations. Figures for the parametric methods are exact and are based on analytical results in Moran and Murphy[4] and Murphy[12]; those for the kernel method are based on simulation. Corresponding unconditional mean squares are presented in Table 3. The figures for both the predictive and kernel methods are now determined from the simulation study. Comparison of the three parametric methods shows clearly the superiority of the predictive method. The linear discriminant function LOe should never be used to estimate log odds without a correction for bias. Even with this correction the resulting estimator LOu is still inferior to the predictive LOp, particularly when the number of variables is large relative to the sample size. The primary interest then, in relation to estimation of odds, focuses on the predictive and kernel estimators. It will be seen from Table 2 that both the parametric and the kernel methods underestimate the true odds when p = 0. This case of independent variables is, however, especially favourable to the product kernel method, and the associated mean squares errors of LOt in Table 3, in line with the classificatory performance, are smaller than those for the predictive method especially as the number of variables increases. For correlated variables, results for the predictive method are unchanged being independent of p, but those for the kernel vary with the degree of correlation. In the case of the parametric configuration (i), the beneficial effect of increasing positive correlation on classification by the kernel method is reflected in a progressive overestimation of true odds. This overestimation can be considerable, as may be seen from the means in Table 2 and the sharply increasing mean squares errors for LOt in Table 3. In the cases of the structures (ii) and (iii), the kernel method underestimates log odds to a greater extent than does the predictive method, and, in terms of mean square, LOx is progressively inferior to LOe as p increases. This is consistent with the corresponding classificatory performance. The pathological case of p = - 0 . 0 6 6 is interesting in that the mean square error of the kernel estimator is relatively modest by comparison with other values of p, but underestimation is so serious that bias is a very substantial component of the mean square error. As a result, the kernel estimator of log odds is very liable to have wrong sign and result in very poor classification.

7. CONCLUSIONS

For independent or moderately positively correlated variables with normal distributions, the product kernel method for discriminant analysis is superior to the conventional parametric methods with regard to classification and estimation of odds. In the presence of other patterns of correlation, the kernel method can behave erratically and may give poor classification or poor estimation of odds, or both, depending on the underlying parametric configuration.

REFERENCES

1. J. Van Ness and C. Simpson. On the effects of dimension in discriminant analysis. Technometrics 18. 175-187 (1976).

2. B. L. Welch. Note on discriminant functions. Biometrika 31. 218-220 (1939). 3. R. A. Fisher. Use of multiple measurements in taxonomic problems. Ann. Eugen. 7, 179-188 11936). 4. M. A. Moran and B. J. Murphy. A closer look at two alternative methods of statistical discrimination. Appl. Stat.

28, 223-230 (1979). 5. J. Aitchison. J. D. F. Habbema and J. W. Kay, A critical comparison of two methods of statistical discrimination.

Appl. Star. 26. 15-25 (1977). 6. E. J. Weyman. Nonparametric probability density estimation I: A summary of available methods. Technometrics

14. 533-546 11972). 7. M. J. Fryer. A review of some non-parametric methods of density estimation. J. Inst. ),,lath. Appl. 20. 335-354

{1977).

Page 11: Parametric and kernel density methods in discriminant analysis: Another comparison

Parametric and kernel density methods 207

8. T. Cacoullos, Estimation of a multivariate density. Ann. Inst. Stat. Math. 18. 179-189 (1966). 9. V. A. Epanechnikov. Nonparametric estimation of a multivariate probability density. Theory Prob. Appl. U.S.S.R.

14, 153-158 (1969). 10. D. E Specht, Generation of polynomial discriminant functions for pattern recognition. IEEE. Trans. Comgut. EC-

16, 308-319 (1967). I 1. J. D. E Habbema. J. Hermans and K. Van Den Broek, A stepwise discrimination analysis program using density

estimation, in Compstat 1974 (Edited by G. Bruckmann), pp. 101-110. Physica Verlag. Vienna (1974). 12. B..I. Murphy, On classical and other methods of discriminant analysis and estimation of log-odds. Ph.D. Thesis.

University College, Cork ( 1981). 13. J. Van Ness, On the effects of dimension in discriminant analysis for unequal covariance populations. Technoraetrics

21, 119-127 (1979). 14. J. Remme, J. D. E Habbema and J. Hermans. A simulative comparison of linear, quadratic and kernel discrim-

ination. J. Star. Comgut. Siraul. 11, 87-106 (1980). 15. S. Das Gupta, Some aspects of discrimination function coefficient. Sankhya Ser. A 30. 387-400 (1968).