Discrimination Rules in Practice

Embed Size (px)

Citation preview

  • 7/27/2019 Discrimination Rules in Practice

    1/7

    DISCRIMINATION RULES IN PRACTICE

    The ML rule is used if the distribution of the data is known up to parameters.

    Suppose forexample that the data come from multivariate normal distributions

    Np(_j ; _). If we have Jgroups with njobservations in each group, we use xjto

    estimate _j , and Sjto estimate _.The common covariance may be estimated bySu

    =XJj=1nj_Sjn J_; (12.9)withn =PJj=1 nj. Thus the empirical version of the ML rule

    of Theorem 12.2 is to allocatea new observation x to _j such that j minimizes(x

    xi)>S1u (x xi) for i 2 f1; : : : ; Jg:

    Estimation of the probabilities of misclassi_cationsMisclassi_cation probabilities

    are given by (12.7) and can be estimated by replacing theunknown parameters by

    their corresponding estimators.For the ML rule for two normal populations we

    obtain^p12 = ^p21 = __12^__where ^_2= (_x1 _x2)>S1u (_x1 _x2) is the

    estimator for _2.

    The probabilities of misclassi_cation may also be estimated by the re-substitution

    method.We reclassify each original observation xi, i = 1; _ _ _ ; n into _1; _ _ _ ;_J

    according to thechosen rule. Then denoting the number of individuals coming

    from _j which have beenclassi_ed into _i by nij, we have ^pij= nij

  • 7/27/2019 Discrimination Rules in Practice

    2/7

    nj, an estimator of pij. Clearly, this method leadsto too optimistic estimators of

    pij, but it provides a rough measure of the quality of thediscriminant rule. The

    matrix (^pij) is called the confussion matrix in Johnson and Wichern(1998).

    Fisher's linear discrimination function

    Another approach stems from R. A. Fisher. His idea was to base the discriminant

    rule on aprojection a>x such that a good separation was achieved. This LDA

    projection method iscalled Fisher's linear discrimination function. IfY = Xadenotes

    a linear combination of observations, then the total sum of squares of

    y,Pni=1(yi_y)2,is equal toY>HY = a>X>HXa= a>T a (12.11)with the centering

    matrix H = I n11n1>n and T = X>HX.Suppose we have samples Xj, j = 1; : : : ; J,

    from J populations. Fisher's suggestion wasto _nd the linear combination a>x

    which maximizes the ratio of the between-group-sum ofsquares to the within-

    group-sum of squares. The within-group-sum of squares is given

    byXJj=1Y>jHjYj=XJj=1a>X>jHjXja= a>Wa; (12.12)whereYjdenotes the j-th sub-

    matrix of Y corresponding to observations of group j and Hjdenotes the (nj_nj)

    centering matrix. The within-group-sum of squares measures the sumof variations

    within each group.The between-group-sum of squares isXJj=1nj(yjy)2

    =XJj=1njfa>(xjx)g2 = a>Ba; (12.13)where yjand xjdenote the means of Yjand

  • 7/27/2019 Discrimination Rules in Practice

    3/7

    Xjand y and x denote the sample means ofY and X. The between-group-sum of

    squares measures the variation of the means acrossgroups.The total sum of

    squares (12.11) is the sum of the within-group-sum of squares and the

    between-group-sum of squares, i.e.,a>T a = a>Wa+ a>Ba:Fisher's idea was to

    select a projection vector a that maximizes the ratioa>Baa>Wa: (12.14)The

    solution is found by applying Theorem 2.5.THEOREM 12.4 The vector a that

    maximizes (12.14) is the eigenvector of W1B thatcorresponds to the largest

    eigenvalue.Now a discrimination rule is easy to obtain:classify x into group j

    where a>_xjis closest to a>x, i.e.,x! _j where j = argminija>(x _xi)j:When J = 2

    groups, the discriminant rule is easy to compute. Suppose that group 1 has

    n1elements and group 2 has n2 elements. In this caseB =_n1n2n_dd>;

    whered = (x1 x2). W1B has only one eigenvalue which equalstr(W1B)

    =_n1n2n_d>W1d;and the corresponding eigenvector is a = W1d. The

    corresponding discriminant rule isx!_1 if a>fx12 (x1 +x2)g>0;x ! _2 if a>fx12 (x1

    + x2)g _ 0:(12.15)

    The Fisher LDA is closely related to projection pursuit (Chapter 18) since the

    statisticaltechnique is based on a one dimensional index a>x.

    c. Classification Rules

  • 7/27/2019 Discrimination Rules in Practice

    4/7

    To develop a classification rule for classifying an observation y into one or the

    other populationin the two group case requires some new notation. First, we let

    f1 (y) andf2 (y)represent the probability density functions (pdfs) associated with

    the random vector Y forpopulations 1 and 2, respectively. We letp1 andp2 be

    the prior probabilities that y is amember of1 and 2, respectively, where p1 +

    p2 = 1. And, we let c1 = C (2 | 1) andc2 = C (1 | 2) represent the misclassification

    cost of assigning an observation from 2 to1, and from 1 to 2, respectively.

    Then, assuming the pdfs f1 (y) and f2 (y) are known,the total probability of

    misclassification (T PM) is equal to p1 times the probability ofassigning an

    observation to 2 given that it is from 1, P (2 | 1), plus p2 times the

    probabilitythat an observation is classified into 1 given that it is from 2, P (1 |

    2). Hence,T PM =p1P (2 | 1) + p2P (1 | 2) (7.2.14)The optimal error rate (OER) is

    the error rate that minimizes the T PM. Taking costs intoaccount, the average or

    expected cost of misclassification is defined asECM =p1P (2 | 1)C(2 | 1) +p2P (1 |

    2)C (1 | 2) (7.2.15)A reasonable classification rule is to make the ECM as small as

    possible. In practice costsof misclassification are usually unknown.To assign an

    observation y to 1 or 2, Fisher (1936) employed his LDF. To apply therule, he

    assumed that _1 = _2 = _ and because he did not assume any pdf, Fishers

  • 7/27/2019 Discrimination Rules in Practice

    5/7

    ruledoes not require normality. He also assumed thatp1 =p2 and that C (1 | 2) =

    C (2 | 1).

    Using (7.2.3), we see that D2 >0 so that L1 L2 >0 and L1 >L2. Hence, ifL =

    a_sy=_y1 y2__S1y >L1 +L22(7.2.16

    Factor Analysis

    When there are many variables in a research design, it is often helpful to reduce

    the variables to a smaller set of factors. This is an independence technique, in

    which there is no dependent variable. Rather, the researcher is looking for the

    underlying structure of the data matrix. Ideally, the independent variables are

    normal and continuous, with at least 3 to 5 variables loading onto a factor. The

    sample size should be over 50 observations, with over 5 observations per

    variable.

  • 7/27/2019 Discrimination Rules in Practice

    6/7

    Multicollinearity is generally preferred between the variables, as the correlations

    are key to data reduction. Kaisers Measure of Statistical Adequacy (MSA) is a

    measure of the degree to which every variable can be predicted by all other

    variables.

    An overall MSA of .80 or higher is very good, with a measure of under .50 deemed

    poor. There are two main factor analysis methods: common factor analysis, which

    extracts factors based on the variance shared by the factors, and principal

    component analysis, which extracts factors based on the total variance of the

    factors. Common factor analysis is used to look for the latent (underlying) factors,

    where as principal components analysis is used to find the fewest number of

    variables that explain the most variance.

    The first factor extracted explains the most variance. Typically, factors are

    extracted as long as the eigenvalues are greater than 1.0 or the scree test visually

    indicates how many factors to extract. The factor loadings are the correlations

    between the factor and the variables. Typically a factor loading of .4 or higher is

    required to attribute a specific variable to a factor. An orthogonal rotation

  • 7/27/2019 Discrimination Rules in Practice

    7/7

    assumes no correlation between the factors, whereas an oblique rotation is used

    when some relationship is believed to exist.