Biometrics on the Lake IBS Australian Regional Conference 2009 Taupo, New Zealand, 29 Nov - 3 Dec

Comparison of the performance of QDF with that of the discriminant function (AEDC) based on absolute

deviation from the mean.

Biometrics on the LakeIBS Australian Regional Conference 2009

Taupo, New Zealand, 29 Nov - 3 Dec

S. Ganesalingam*S. Ganesh* and A. Nanthakumar#

*Institute of Fundamental Sciences, Massey University, New Zealand

#Department of Mathematics, SUNY Oswego, USA

IBS2009Taupo, NZ

2

The estimation of the error rates is of vital importance in classification problems, as this is used to choose the best discriminant function; i.e. the one with a minimum miss classification error.

Consider the problem of statistical discrimination involving two multivariate normal distributions with equal means but different covariance matrices. Traditionally, a quadratic discriminant function (QDF) is used to separate two such populations. Ganesalingam and Ganesh (2004) introduced a linear discriminant function called ‘Absolute Euclidean Distance Classifier (AEDC)’ and compared its performance with that of QDF on simulated data in terms of their associated misclassification error rates. In this paper, approximate analytical expressions for the overall error rate associated with the AEDC and QDF are derived and computed for the various covariance structures in a simulation exercise which serve as a bench mark for comparison.

Another approximation we introduce in this paper simplifies the amount of computations involved. Also, this approximation provides a closed form expression for the tail areas of most symmetrical distributions that is very useful in many practical situations such as the misclassification error computation in discriminant analysis.

Abstract

IBS2009Taupo, NZ

3

The choice of a discriminant function is mainly determined by the associated error rates…

Hence the estimation of error rates is of vital importance in classification problems.

Hand (1986) gave the following quote of Glick (1978) about the importance of error rates estimation.

“The task of estimating the probabilities of correct classification confronts the statistician simultaneously with difficult distribution theory, questions intervening sample size, and dimension, problems of bias, variance, robustness, and computation costs. But, coping with such conflicting concerns (at least in my experience) enhances understanding of many aspects of statistical classification and stimulates insight into general methodology of estimation”.

Introduction

IBS2009Taupo, NZ

4

Consider the problem of statistical discrimination involving two multivariate normal populations 1 and 2 with mean vectors µ1 and µ2 and covariance matrices Σ1 and Σ2 respectively.

Further assume without loss of generality that Σ1 > Σ2, i.e. 1 has a larger covariance structure than 2.

These parameters are not generally known. The discriminant function which would normally be used in such a situation is the

‘quadratic discriminant function (QDF)’, which allocates an object with observation vector x to 1, if

Introduction…

(1)

otherwise it is allocated to 2 (see for example Morrison (1990)). In the above allocation rule and throughout this paper,

we assume equal priors and equal cost of misclassification.

T T2 1 11 1 1 2 2 2

1ln 0

Σx μ Σ x μ x μ Σ x μ

Σ

IBS2009Taupo, NZ

5

However, if Σ1 = Σ2 = Σ, then the object with observation vector x is allocated (using the well-known ‘linear discriminant function (LDF)’) to population 1, if

Introduction…

T 11

1 2 1 220 x μ μ Σ μ μ

(2)

T1

1 2 1 220 x μ μ μ μ

otherwise it is allocated to 2. The “Euclidean distance classifier (EDC)” ignores the covariance matrices and

allocates an individual with observation vector x according to the following rule: Allocate the observation vector x to 1, if

otherwise it is allocated to 2.

(3)

IBS2009Taupo, NZ

6

It has been shown that the EDC may perform better than the linear discriminant function under certain circumstances.

Note that in its original form, both EDC and LDF cannot be used when µ1 = µ2.

We thus consider the “Absolute Euclidean Distance Classifier (AEDC)”, whereby the absolute values of the components of the observation vector X are used in the EDC.

The expectation is that it may do well, particularly in high dimensional settings, since it is also a form of regularisation.

In real practice Σ1 Σ2, and in such a situation the main alternative is to use the QDF on the raw data or AEDC based on absolute values of the deviations of the observations from the mean value.

(See Ganesalingam and Ganesh (2004) for comparisons of QDF and AEDC in discriminating two bivariate normal populations, and Ganesalingam et.al. (2006) for two normal populations with more than 2 variables.)

Introduction…

IBS2009Taupo, NZ

7

Introduction… (AEDC)

IBS2009Taupo, NZ

8

Here, we wish to explore the estimation of error rates using different methods and see how they compare by means of a real life case study.

The data set used comes from an anthropological study undertaken in the University of Hamburg, Germany and is reported in Flury (1997). This data consists of 89 pairs of male twins.

Of the 89 pairs, 40 are dizygotic and 49 are monozygotic. There are six variables for each pair of twins.

These are stature, hip width and chest circumference for each of the two brothers. Taking the difference between the first and the second twins, we used only the variables difference in hip width and difference in chest circumference, and considered as a two dimensional classification problem.

Motivation: Case Study…

IBS2009Taupo, NZ

9

Let Σ1 and Σ2 be the respective covariance matrices of the dizygotic and monozygotic populations. We may utilise the estimates from the given data.

Motivation: Case Study…

1 23.2256 5.3114 0.8154 0.3967ˆ ˆ and 5.3114 15.3272 0.3967 5.9682

Σ Σ

As expected, the estimates of the means of the monozygotic and dizygotic populations are close to zero.

1 20.155 0.1

ˆ ˆ and 0.675 0.632

μ μ

This is understandable because, by nature, the twins are bound to have similar (closer) values for each of the six variables in the original study, hence the absolute difference should expected to zero or near zero and thus the means of the difference to be zero or close to zero. This is usually when the linear discriminant function fails and we resort to QDF or AEDC.

IBS2009Taupo, NZ

10

In this talk, our attention is focused on The analytical computation of the actual misclassification error rates

associated with AEDC and QDF in a two dimensional situation (p=2) discriminating two normally distributed populations...(with equal means and un-equal covariance matrices)

A ‘numerical-integrated’ approach to computing these actual error rates... And, a ‘triangular distribution’ based approximation to these error rates…

Introduction…

IBS2009Taupo, NZ

11

Let us consider the bivariate normal observation vector x = (x1 x2)T

and say this vector has a probability density function g(x) with mean zero and variance-covariance matrix

Probability density function of Y=|X|

1 2 1 2(y ,y ) x , x y

(4)

11 12

21 22xΣ

1 2 1 2 1 2 1 2f y g x ,x g x , x g x , x , g x ,x

1 2 1 22g x ,x 2g x ,x

T2 2y 11 22

μ

211 1 2

y2

1 2 22

1 cov(y ,y )

cov(y ,y ) 1

Σ and

If with |x i| denoting absolute value of xi, then

And, mean vector µy and covariance matrix Σy of Y can be shown as,

where, 221 2 11 22 12 11 22cov(y ,y ) ( )

IBS2009Taupo, NZ

12

Now, we give the Euclidean distance classifier (EDC) based on the absolute values of the original observation vector for a bivariate normal data, i.e. AEDC…

Recall that, the EDC will allocate an individual observation vector x according to the following rule (also given as (3)) to population 1, if

Discrimination using absolute values


Under the assumption of equal means, and using the absolute values Y=|X|, this rule takes the form:

allocate a two-dimensional observation vector x to population 1, if

(3)

(5)where i

(k) is the mean of the ith component of observation vector y in the kth population for i=1, 2 and k=1, 2.

T1

1 2 1 220 x μ μ μ μ

2 2 2 21 2 1 2 1 2 1 21 1

1 21 1 1 1 2 2 2 22 2y y 0

IBS2009Taupo, NZ

13

So, the classifier AEDC reads as:

Allocate the observation vector x (or y) to population 1, if (using (4))

Discrimination using absolute values…

else to population 2.

Here jj(k) is the variance of Xj in population k, k=1, 2.

This means, allocate an observation x to population 1, if

otherwise to population 2.

When expressed in terms of Y, (6) takes the form (using (4))

where Ykj is the mean of the jth component of Y in k (k, j = 1, 2).

(1) (2) (1) (2) (1) (2) (1) (2)2 2 1 1 2 2 1 21 211 11 11 11 22 22 22 222 2

y y 0

(1) (2) (1) (2) (1) (2) (1) (2)11 211 11 22 22 11 11 22 222

y y ( )

(6)

2 2 2 211 2 211 21 12 22 11 21 12 22Y Y Y Y Y Y Y Yy y

IBS2009Taupo, NZ

14

(for the misclassification error rates associated with AEDC)

Analytical Expression… (AEDC)

Note that each of the inequalities in (7) can be easily identified as defining a parallelogram which we will call, ‘region A’.

where (7)

Here, we attempt to give, for the bivariate case, an analytical expression for the actual overall misclassification error rate. The derivation is as follows:

Let Pij be the error of misclassifying an observation from i to j, (i, j=1,2)

Thus we have, P12 = Pr[c1y1 + c2y2 c3 | y 1 & y1, y2 0] which in terms of the original x’s reads,

1 1 2 2 3 1 2 2 2 1 1 3 1 212

1 1 2 2 3 1 2 1 1 2 2 3 1 2

{c x c x c | x 0;x 0} or {c x - c x c | x 0;x 0} or P Pr ob

{c x - c x c | x 0;x 0} or { - c x - c x c | x 0;x 0}

(1) (2) (1) (2) (1) (2) (1) (2)11 2 311 11 22 22 11 11 22 222

c , c , c ( ) ( )

IBS2009Taupo, NZ

15


where ij(k) are elements of upper triangular matrix Γ such that , the

Cholesky decomposition of the matrix , and are given by

(8)

Thus we have,

(1) (1)(1) (1) (1)12 1111 12 22(1) (1)

11 111 1

1; ;

DD

2(1) (1) (1)1 1 11 22 12D det( ) - and

T 11Γ Γ Σ

1

2 2(1) (1) (1)1 112 1 2 1 2 2 1 1 1 222 12 1122 D

A A

P f(x ,x )dx dx exp x x x dx dx

11Σ

IBS2009Taupo, NZ

16

which can be easily shown as

(9)

Using symmetry of the region A (of integration), we may re-write (8) as,

Thus,

where,

and (.) representing cumulative density of N(0,1) distribution.

The misclassification error rate P21 can be defined in a similar manner replacing ij

(1) by ij(2) and D1 by D2 (and ij

(1) by ij(2)).

1

2 2(1) (1) (1)1 112 2 1 1 2 122 12 1122 D

x 0 x1 2

P exp x x x dx dx

1

1 2

2 2(1) (1) (1)1 1 112 2 1 2 1 122 12 112 22 D

x 0 x

P exp x x dx exp x dx

(1)

122

2(1) (1) (1) (1) (1)1 112 1 1 1 122 12 22 12 1122 D x 01

P x x exp x dx

3 3 31 1

1 2 2 2 2

c c cc c1 1c c c c c

, x , x


IBS2009Taupo, NZ

17

We shall consider the case of equal means, µ1 = µ2 = µ = (µ1 µ2)T, say…

And deriving an expressions for P12 and P21 …

Under this scenario, QDF will allocate observation x to population 1, if (derived from (1))


Analytical Expression… (QDF)

Using notation used so far, we may write, (for vector x = (x1 x2)T)

where 4 = and … (see next slide)

2 2

1 1 1 2 2 2 3 2 2 4QDF x x x

(10)

T2 1 12 1

1QDF ln 0

Σx μ Σ Σ x μ

Σ

2

1

Dln

D

IBS2009Taupo, NZ

18


(11)

2 1 2 11 2222 22 11 1112 12

1 2 1 3 21 22 1 2 1

, & D DD D D D

Consider the case of ‘given x2 and x2’:

QDF can be written as (say, QC = QDF|x2,2),

where

2

C 1 1 1 1 2Q x

21 2 2 2 2 3 2 2 4 1 1 1 1x , x & Y x

We need to derive a distribution for QC in order to evaluate the error rates when applying the model to classify an observation…

So, first consider the distribution of Y and then that of QC...

22Y( , say)

IBS2009Taupo, NZ

19


For x 2, (for convenience, we shall first consider P21…)

2 211 121 1

2 22 221 22

XN ,

X 21 X 1 1 X N ,

22(2) (2) (2) (2)1 1 2 2 112 22 11 12 22where x &

So, E(Y|x2,2) & V(Y|x2,2) can be written as

2 2 2 2

2 2 2 1

(2) (2)x , 1 1 x , 1 1 1 2 2 112 22

2(2) (2) (2)2 2x , 1 1 x , 1 11 12 22

Y

Y

E Y E x + x

V Y V x

IBS2009Taupo, NZ

20


Hence, Y|x2,2 N(Y, Y)

Y

Y Y

YN ,1 N ,1

221

Y

Y

, say with

i.e., Chi-sq with 1 d.f. and non-centrality parameter

2C 2

1Y

Q - since 2

C 2Q Y

2Y

Y

Note that,

12 2 2

21 V 3

122

e e

V f ( ) e e

2

IBS2009Taupo, NZ

21


Hence, the density of QC (i.e. QDF given x2 & x2) can be shown to be,

12

C

1 23c 2 c 21 Y12

Q c 2 2Y Y Y Y

2 2c 2 c 2Y Y

Y Y Y Y

q - q -1f (q ) 2 exp

q - q - exp exp

The unconditional (w.r.t x2) density of QDF (when x 2) can be obtained via,

C 2Q Q c X 2 2f (q) f (q ) f (x ) dx

Note that,

2

2 211 121 1 (2)

2 2 222 22 221 22

XN , X N ,

X

(i.e. integrate over X2…)

(in 2)

IBS2009Taupo, NZ

22


Hence, P21 (the QDF misclassification error when x2) can be obtained via,

2 C 221 Q Q c X 2 2 c

0 0P P Q 0 f (q) dq f (q ) f (x ) dx dqx

12

1 23c 2 c 2 Y12 1

Y 2 2Y Y Y0

2 2c 2 c 2Y Y

Y Y Y Y

1(2)22

q - q -2 exp

q - q - exp exp

2 e

2

2 212 c2 (2)

22

xxp dx dq

Note: QC = QDF|x2,2

IBS2009Taupo, NZ

23


By letting, &

0

1 23 (2) 2 211221 Y 222 2 (2)

22

2 2 21 2 2 21 Y Y Y

22Y Y Yu

xP 2 2 exp

u exp u exp u exp u 2u du dx

2 c 2

Y

q -u

2

0Y

-u

0 0

2 22 2

1 Y 1 Y2 2

Y Yu u

1 12 2 exp u du exp u du

2 2

2 2y y

0 0y y

2 2 2 u uNote that, u0 & µy arefunctions of x2 only…

Note this… 2<0

IBS2009Taupo, NZ

24


So,

13 (2)12

21 Y 222

2y y2 21

0 0 22 (2)y y22

P 2 2

x exp 2 u u dx

2(2) (2) (2) (2) (2)2 2Y 1 2 2 1 Y 1 012 22 11 12 22

Y

21 2 2 2 2 3 2 2 4

-x , & u

x & x

Here,

The expression for (1 - P12) can be obtained in a similar manner replacing by in P21, Y and Y only...

(12)

(2)ij (1)

ij

2 1 2 11 22 222 22 11 1112 12

1 2 1 3 2 41 22 1 2 1 1

D, , & ln

D DD D D D D

IBS2009Taupo, NZ

25

Here, instead of evaluating the integral in (9) for AEDC (and that in (12) for QDF) as they are, an expression is developed as an approximation to the integral.

The process is based on the idea of approximating the normal distribution by the well-known ‘triangular’ distribution.

There is considerable literature involving the use of the triangular density in applications. The reader is referred to a recent paper by Scherer et al. (2003) for a complete description of this approximation which has an extensive use in ‘risk modelling’.

In its basic form, triangular distribution approximation to normal distribution works as follows…

Using Triangular Distribution Approximation

IBS2009Taupo, NZ

26

The triangular distribution is completely characterized by three parameters: the minimum value (denoted by a), the maximum value (say, b) and the mode value (say, c). We may denote a triangular distribution with these parameters using the notation, “Tri(a, b, c)”.

If X ~ N(,) with mean and standard deviation , then it may be approximated by a symmetric triangular distribution (or tine) for which, a = - w, b = + w and c = (a+b)/2 = , where w 6 (2).

An example is shown in Figure 1.

Triangular Approximation…

Figure 1:

A normal density with =100 & =20 and the associated approximating triangular density function.

IBS2009Taupo, NZ

27

The distribution function for a triangular distribution Tri(a, b, c) is given by,

Triangular Approximation… (AEDC)

2

X 2

0, if x a

x - a, if a x c

b - a c - aF x

b - x1 , if c<x b

b - a b - c

1, if x b

(13)

Using the distribution function in (13) to approximate the distribution function of N(0,1) with parameter values c = 0, a = -(2) & b = (2), we may approximate P12 in (9) as follows:

First, consider , say

We may approximate this by FX(z1) - FX(z2), where FX(x) is given by (13).

We also need to examine the various conditions, for example, z2 a, c < z1 b etc., within the constraint that 0 x1 c3/c1 in order to evaluate FX(z1) - FX(z2) appropriately…

(1) (1) (1) (1)22 12 1 22 12 1 1 2x x z z

IBS2009Taupo, NZ

28

After some algebra (!), we may show (9) for AEDC...

Triangular Approximation… (AEDC)

(14)

where are defined as 1 1 11 2 3, &

IBS2009Taupo, NZ

29

Here,

with c1, c2 and c3 and as defined before...

Triangular Approximation… (AEDC)(1)

(1)1 3 3221 22

2 2

c c2 2 ;

2 c c

(1)(1)1 322 1

2 222 2

cc2 ;

c c

22 2(1) (1)1 1

3 22 122

c12 c

Approximation formula for P21 can be obtained in a similar manner replacing

by and D1 by D2.

We shall refer to these error rates as ‘triangular-approximated’ error rates.

Note here that, computation of P12 or P21 does not involve inversion of covariance

matrices…

1ij

2ij

1ij

IBS2009Taupo, NZ

30

To be completed…!

Triangular Approximation… (QDF)

IBS2009Taupo, NZ

31

The AEDC error rates P12 (given by (9)) and P21 can be evaluated via numerical integration process…

The QDF error rates P21 (given by (12)) and P12 can be evaluated via numerical integration process…

The R software can be utilised… In the AEDC case, we have a finite interval for integration, a globally

adaptive interval subdivision can used and like all numerical integration routines, the integral can be evaluated on a finite set of points…

In the QDF case, we have an infinite interval for integration! So, an ‘approximate’ interval subdivision may used and the integral can be evaluated on a finite set of points… (use very large –ve and very large +ve limits!)

Using Numerical Integration

IBS2009Taupo, NZ

32

This data consists of 89 pairs of male twins. Of the 89 pairs, 40 are dizygotic (1) and 49 are monozygotic (2).

The ‘overall’ error rates can be computed as,

POverall = (40/89)*P12 + (49/89)* P21

The ‘numerical-integrated’, ‘cross-validated’ and ‘triangular approximated’ overall error rates associated with AEDC are,

The ‘numerical-integrated’ and ‘cross-validated’ overall error rates associated with QDF are,

The P12 and P21 values are,

Case Study… (Discussions)

AEDC AEDC

AEDC AEDC

AEDC AEDC

12 21

12 21

12 21

P NI 0.2367, P NI 0.3679

P CV 0.4000, P CV 0.0816

P TR 0.2316, P TR 0.3484

AEDC AEDC AEDC

Overall Overall OverallP NI 0.3023, P CV 0.2247, P TR 0.2959

QDF QDF

Overall OverallP NI 0.3386, P CV 0.2697

QDF QDF

QDF QDF

12 21

12 21

P NI 0.3049, P NI 0.3661

P CV 0.3750, P CV 0.1837

IBS2009Taupo, NZ

33

For the case study of twins data considered… The ‘triangular-approximated’ overall error rate is very similar to, though lower

than, the ‘numerical-integrated’ (actual) error rate for AEDC approach. The overall actual error rate (numerically-integrated) associated with QDF is higher

(by about 3.5%) than that associated with AEDC. The cross-validated (leave-one-out) estimates of overall error rates are lower than

the above actual error rates in both AEDC and QDF cases.

Conclusions…

IBS2009Taupo, NZ

34

We have studied the behaviour of the AEDC approach compared to the traditional QDF approach in the context of two variables for separating two populations. … used analytical expressions for the expected error rates associated with

AEDC and QDF … used a ‘triangular approximation’ to derive the formula for the classification

error rates in exact form for AEDC. (A similar approach to QDF is possible.)

In fact, the approximate formula presented here for AEDC is an extension of the formula given by Lachenbruch (1975) for the one variable situation.

The major attraction towards the ‘triangular-approximated’ approach is that the expected error rate could be derived in exact form in terms of the elements of the given covariance matrices, as opposed to relying on a computer software to carry out the numerical integration process – usually on a finite number of partitions.

Conclusions…

IBS2009Taupo, NZ

35

The main competitor for AEDC approach is the well-known QDF which is traditionally used for discriminating two populations with distinct covariance matrices.

The use of QDF is acceptable as long as the covariance matrices are non singular. But in real life problems, in particular, with high dimensions, the variables are often correlated and hence the covariance matrix exhibits singularity.

This was the main reason for the inferior performance of the QDF when compared to AEDC in the higher dimensions as observed by Ganesalingam et.al (2006).

AEDC, on the other hand, ignores the covariance matrices completely and becomes more user friendly in terms of error rate computation.

Therefore, we recommend the use of AEDC in the case of two population discrimination problems with equal means, but different covariance matrices.

Need a large scale simulation study...

Conclusions…

IBS2009Taupo, NZ

36

Ganesalingam, S., Ganesh, S. and Nanthakumar. A. (2008) ‘Approximation for error rates associated with the discriminant function based on absolute deviation from the mean’, Journal of Statistics and Management Systems, 11(5), 861-881.

Ganesalingam, S., Nanthakumar, N. and Ganesh, S. (2006) ‘A comparison of the quadratic discriminant function with discriminant function based on the absolute deviation from the mean’, Journal of Statistics and Management Systems, 9(2), 441-457.

Ganesalingam, S. and Ganesh, S. (2004) ‘Statistical discrimination based on absolute deviation from the mean’, Journal of Statistics and Management Systems, 7(1), 25-40.

Glick, N. (1978) ‘Additive estimators for probabilities of correct classification’, Pattern Recognition, 10, 211-222.

Hand, D.J. (1986) ‘Recent advances in error rate estimation’, Pattern Recognition letters, 4, 335-346.

Lachenbruch, P.A. (1975) ‘Zero-mean difference discrimination and the absolute linear discriminant function’, Biometrika, 62(2), 397-401.

R software (2009) http://www.r-project.org/. Scherer, W.T., Pomeroy, T.A. and Fuller, D.N. (2003) ‘The triangular density to

approximate the normal density:decision rules-of-thumb’, Reliability Engineering & System Safety, 82(3), 331-341.

References…

Documents

Biometrics on the Lake IBS Australian Regional Conference 2009 Taupo, New Zealand, 29 Nov - 3 Dec