Discrimination Rules in Practice

7/27/2019 Discrimination Rules in Practice

1/7

DISCRIMINATION RULES IN PRACTICE

The ML rule is used if the distribution of the data is known up to parameters.

Suppose forexample that the data come from multivariate normal distributions

Np(_j ; _). If we have Jgroups with njobservations in each group, we use xjto

estimate _j , and Sjto estimate _.The common covariance may be estimated bySu

=XJj=1nj_Sjn J_; (12.9)withn =PJj=1 nj. Thus the empirical version of the ML rule

of Theorem 12.2 is to allocatea new observation x to _j such that j minimizes(x

xi)>S1u (x xi) for i 2 f1; : : : ; Jg:

Estimation of the probabilities of misclassi_cationsMisclassi_cation probabilities

are given by (12.7) and can be estimated by replacing theunknown parameters by

their corresponding estimators.For the ML rule for two normal populations we

obtain^p12 = ^p21 = __12^__where ^_2= (_x1 _x2)>S1u (_x1 _x2) is the

estimator for _2.

The probabilities of misclassi_cation may also be estimated by the re-substitution

method.We reclassify each original observation xi, i = 1; _ _ _ ; n into _1; _ _ _ ;_J

according to thechosen rule. Then denoting the number of individuals coming

from _j which have beenclassi_ed into _i by nij, we have ^pij= nij


2/7

nj, an estimator of pij. Clearly, this method leadsto too optimistic estimators of

pij, but it provides a rough measure of the quality of thediscriminant rule. The

matrix (^pij) is called the confussion matrix in Johnson and Wichern(1998).

Fisher's linear discrimination function

Another approach stems from R. A. Fisher. His idea was to base the discriminant

rule on aprojection a>x such that a good separation was achieved. This LDA

projection method iscalled Fisher's linear discrimination function. IfY = Xadenotes

a linear combination of observations, then the total sum of squares of

y,Pni=1(yi_y)2,is equal toY>HY = a>X>HXa= a>T a (12.11)with the centering

matrix H = I n11n1>n and T = X>HX.Suppose we have samples Xj, j = 1; : : : ; J,

from J populations. Fisher's suggestion wasto _nd the linear combination a>x

which maximizes the ratio of the between-group-sum ofsquares to the within-

group-sum of squares. The within-group-sum of squares is given

byXJj=1Y>jHjYj=XJj=1a>X>jHjXja= a>Wa; (12.12)whereYjdenotes the j-th sub-

matrix of Y corresponding to observations of group j and Hjdenotes the (nj_nj)

centering matrix. The within-group-sum of squares measures the sumof variations

within each group.The between-group-sum of squares isXJj=1nj(yjy)2

=XJj=1njfa>(xjx)g2 = a>Ba; (12.13)where yjand xjdenote the means of Yjand


3/7

Xjand y and x denote the sample means ofY and X. The between-group-sum of

squares measures the variation of the means acrossgroups.The total sum of

squares (12.11) is the sum of the within-group-sum of squares and the

between-group-sum of squares, i.e.,a>T a = a>Wa+ a>Ba:Fisher's idea was to

select a projection vector a that maximizes the ratioa>Baa>Wa: (12.14)The

solution is found by applying Theorem 2.5.THEOREM 12.4 The vector a that

maximizes (12.14) is the eigenvector of W1B thatcorresponds to the largest

eigenvalue.Now a discrimination rule is easy to obtain:classify x into group j

where a>_xjis closest to a>x, i.e.,x! _j where j = argminija>(x _xi)j:When J = 2

groups, the discriminant rule is easy to compute. Suppose that group 1 has

n1elements and group 2 has n2 elements. In this caseB =_n1n2n_dd>;

whered = (x1 x2). W1B has only one eigenvalue which equalstr(W1B)

=_n1n2n_d>W1d;and the corresponding eigenvector is a = W1d. The

corresponding discriminant rule isx!_1 if a>fx12 (x1 +x2)g>0;x ! _2 if a>fx12 (x1

+ x2)g _ 0:(12.15)

The Fisher LDA is closely related to projection pursuit (Chapter 18) since the

statisticaltechnique is based on a one dimensional index a>x.

c. Classification Rules


4/7

To develop a classification rule for classifying an observation y into one or the

other populationin the two group case requires some new notation. First, we let

f1 (y) andf2 (y)represent the probability density functions (pdfs) associated with

the random vector Y forpopulations 1 and 2, respectively. We letp1 andp2 be

the prior probabilities that y is amember of1 and 2, respectively, where p1 +

p2 = 1. And, we let c1 = C (2 | 1) andc2 = C (1 | 2) represent the misclassification

cost of assigning an observation from 2 to1, and from 1 to 2, respectively.

Then, assuming the pdfs f1 (y) and f2 (y) are known,the total probability of

misclassification (T PM) is equal to p1 times the probability ofassigning an

observation to 2 given that it is from 1, P (2 | 1), plus p2 times the

probabilitythat an observation is classified into 1 given that it is from 2, P (1 |

2). Hence,T PM =p1P (2 | 1) + p2P (1 | 2) (7.2.14)The optimal error rate (OER) is

the error rate that minimizes the T PM. Taking costs intoaccount, the average or

expected cost of misclassification is defined asECM =p1P (2 | 1)C(2 | 1) +p2P (1 |

2)C (1 | 2) (7.2.15)A reasonable classification rule is to make the ECM as small as

possible. In practice costsof misclassification are usually unknown.To assign an

observation y to 1 or 2, Fisher (1936) employed his LDF. To apply therule, he

assumed that _1 = _2 = _ and because he did not assume any pdf, Fishers


5/7

ruledoes not require normality. He also assumed thatp1 =p2 and that C (1 | 2) =

C (2 | 1).

Using (7.2.3), we see that D2 >0 so that L1 L2 >0 and L1 >L2. Hence, ifL =

a_sy=_y1 y2__S1y >L1 +L22(7.2.16

Factor Analysis

When there are many variables in a research design, it is often helpful to reduce

the variables to a smaller set of factors. This is an independence technique, in

which there is no dependent variable. Rather, the researcher is looking for the

underlying structure of the data matrix. Ideally, the independent variables are

normal and continuous, with at least 3 to 5 variables loading onto a factor. The

sample size should be over 50 observations, with over 5 observations per

variable.


6/7

Multicollinearity is generally preferred between the variables, as the correlations

are key to data reduction. Kaisers Measure of Statistical Adequacy (MSA) is a

measure of the degree to which every variable can be predicted by all other

variables.

An overall MSA of .80 or higher is very good, with a measure of under .50 deemed

poor. There are two main factor analysis methods: common factor analysis, which

extracts factors based on the variance shared by the factors, and principal

component analysis, which extracts factors based on the total variance of the

factors. Common factor analysis is used to look for the latent (underlying) factors,

where as principal components analysis is used to find the fewest number of

variables that explain the most variance.

The first factor extracted explains the most variance. Typically, factors are

extracted as long as the eigenvalues are greater than 1.0 or the scree test visually

indicates how many factors to extract. The factor loadings are the correlations

between the factor and the variables. Typically a factor loading of .4 or higher is

required to attribute a specific variable to a factor. An orthogonal rotation


7/7

assumes no correlation between the factors, whereas an oblique rotation is used

when some relationship is believed to exist.

Documents

Discrimination Rules in Practice