Upload
yaronbaba
View
212
Download
0
Embed Size (px)
Citation preview
7/27/2019 Discrimination Rules in Practice
1/7
DISCRIMINATION RULES IN PRACTICE
The ML rule is used if the distribution of the data is known up to parameters.
Suppose forexample that the data come from multivariate normal distributions
Np(_j ; _). If we have Jgroups with njobservations in each group, we use xjto
estimate _j , and Sjto estimate _.The common covariance may be estimated bySu
=XJj=1nj_Sjn J_; (12.9)withn =PJj=1 nj. Thus the empirical version of the ML rule
of Theorem 12.2 is to allocatea new observation x to _j such that j minimizes(x
xi)>S1u (x xi) for i 2 f1; : : : ; Jg:
Estimation of the probabilities of misclassi_cationsMisclassi_cation probabilities
are given by (12.7) and can be estimated by replacing theunknown parameters by
their corresponding estimators.For the ML rule for two normal populations we
obtain^p12 = ^p21 = __12^__where ^_2= (_x1 _x2)>S1u (_x1 _x2) is the
estimator for _2.
The probabilities of misclassi_cation may also be estimated by the re-substitution
method.We reclassify each original observation xi, i = 1; _ _ _ ; n into _1; _ _ _ ;_J
according to thechosen rule. Then denoting the number of individuals coming
from _j which have beenclassi_ed into _i by nij, we have ^pij= nij
7/27/2019 Discrimination Rules in Practice
2/7
nj, an estimator of pij. Clearly, this method leadsto too optimistic estimators of
pij, but it provides a rough measure of the quality of thediscriminant rule. The
matrix (^pij) is called the confussion matrix in Johnson and Wichern(1998).
Fisher's linear discrimination function
Another approach stems from R. A. Fisher. His idea was to base the discriminant
rule on aprojection a>x such that a good separation was achieved. This LDA
projection method iscalled Fisher's linear discrimination function. IfY = Xadenotes
a linear combination of observations, then the total sum of squares of
y,Pni=1(yi_y)2,is equal toY>HY = a>X>HXa= a>T a (12.11)with the centering
matrix H = I n11n1>n and T = X>HX.Suppose we have samples Xj, j = 1; : : : ; J,
from J populations. Fisher's suggestion wasto _nd the linear combination a>x
which maximizes the ratio of the between-group-sum ofsquares to the within-
group-sum of squares. The within-group-sum of squares is given
byXJj=1Y>jHjYj=XJj=1a>X>jHjXja= a>Wa; (12.12)whereYjdenotes the j-th sub-
matrix of Y corresponding to observations of group j and Hjdenotes the (nj_nj)
centering matrix. The within-group-sum of squares measures the sumof variations
within each group.The between-group-sum of squares isXJj=1nj(yjy)2
=XJj=1njfa>(xjx)g2 = a>Ba; (12.13)where yjand xjdenote the means of Yjand
7/27/2019 Discrimination Rules in Practice
3/7
Xjand y and x denote the sample means ofY and X. The between-group-sum of
squares measures the variation of the means acrossgroups.The total sum of
squares (12.11) is the sum of the within-group-sum of squares and the
between-group-sum of squares, i.e.,a>T a = a>Wa+ a>Ba:Fisher's idea was to
select a projection vector a that maximizes the ratioa>Baa>Wa: (12.14)The
solution is found by applying Theorem 2.5.THEOREM 12.4 The vector a that
maximizes (12.14) is the eigenvector of W1B thatcorresponds to the largest
eigenvalue.Now a discrimination rule is easy to obtain:classify x into group j
where a>_xjis closest to a>x, i.e.,x! _j where j = argminija>(x _xi)j:When J = 2
groups, the discriminant rule is easy to compute. Suppose that group 1 has
n1elements and group 2 has n2 elements. In this caseB =_n1n2n_dd>;
whered = (x1 x2). W1B has only one eigenvalue which equalstr(W1B)
=_n1n2n_d>W1d;and the corresponding eigenvector is a = W1d. The
corresponding discriminant rule isx!_1 if a>fx12 (x1 +x2)g>0;x ! _2 if a>fx12 (x1
+ x2)g _ 0:(12.15)
The Fisher LDA is closely related to projection pursuit (Chapter 18) since the
statisticaltechnique is based on a one dimensional index a>x.
c. Classification Rules
7/27/2019 Discrimination Rules in Practice
4/7
To develop a classification rule for classifying an observation y into one or the
other populationin the two group case requires some new notation. First, we let
f1 (y) andf2 (y)represent the probability density functions (pdfs) associated with
the random vector Y forpopulations 1 and 2, respectively. We letp1 andp2 be
the prior probabilities that y is amember of1 and 2, respectively, where p1 +
p2 = 1. And, we let c1 = C (2 | 1) andc2 = C (1 | 2) represent the misclassification
cost of assigning an observation from 2 to1, and from 1 to 2, respectively.
Then, assuming the pdfs f1 (y) and f2 (y) are known,the total probability of
misclassification (T PM) is equal to p1 times the probability ofassigning an
observation to 2 given that it is from 1, P (2 | 1), plus p2 times the
probabilitythat an observation is classified into 1 given that it is from 2, P (1 |
2). Hence,T PM =p1P (2 | 1) + p2P (1 | 2) (7.2.14)The optimal error rate (OER) is
the error rate that minimizes the T PM. Taking costs intoaccount, the average or
expected cost of misclassification is defined asECM =p1P (2 | 1)C(2 | 1) +p2P (1 |
2)C (1 | 2) (7.2.15)A reasonable classification rule is to make the ECM as small as
possible. In practice costsof misclassification are usually unknown.To assign an
observation y to 1 or 2, Fisher (1936) employed his LDF. To apply therule, he
assumed that _1 = _2 = _ and because he did not assume any pdf, Fishers
7/27/2019 Discrimination Rules in Practice
5/7
ruledoes not require normality. He also assumed thatp1 =p2 and that C (1 | 2) =
C (2 | 1).
Using (7.2.3), we see that D2 >0 so that L1 L2 >0 and L1 >L2. Hence, ifL =
a_sy=_y1 y2__S1y >L1 +L22(7.2.16
Factor Analysis
When there are many variables in a research design, it is often helpful to reduce
the variables to a smaller set of factors. This is an independence technique, in
which there is no dependent variable. Rather, the researcher is looking for the
underlying structure of the data matrix. Ideally, the independent variables are
normal and continuous, with at least 3 to 5 variables loading onto a factor. The
sample size should be over 50 observations, with over 5 observations per
variable.
7/27/2019 Discrimination Rules in Practice
6/7
Multicollinearity is generally preferred between the variables, as the correlations
are key to data reduction. Kaisers Measure of Statistical Adequacy (MSA) is a
measure of the degree to which every variable can be predicted by all other
variables.
An overall MSA of .80 or higher is very good, with a measure of under .50 deemed
poor. There are two main factor analysis methods: common factor analysis, which
extracts factors based on the variance shared by the factors, and principal
component analysis, which extracts factors based on the total variance of the
factors. Common factor analysis is used to look for the latent (underlying) factors,
where as principal components analysis is used to find the fewest number of
variables that explain the most variance.
The first factor extracted explains the most variance. Typically, factors are
extracted as long as the eigenvalues are greater than 1.0 or the scree test visually
indicates how many factors to extract. The factor loadings are the correlations
between the factor and the variables. Typically a factor loading of .4 or higher is
required to attribute a specific variable to a factor. An orthogonal rotation
7/27/2019 Discrimination Rules in Practice
7/7
assumes no correlation between the factors, whereas an oblique rotation is used
when some relationship is believed to exist.