74
Discrimination and Classification Nathaniel E. Helwig Assistant Professor of Psychology and Statistics University of Minnesota (Twin Cities) Updated 14-Mar-2017 Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 1

Nathaniel E. Helwig

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Nathaniel E. Helwig

Discrimination and Classification

Nathaniel E. Helwig

Assistant Professor of Psychology and StatisticsUniversity of Minnesota (Twin Cities)

Updated 14-Mar-2017

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 1

Page 2: Nathaniel E. Helwig

Copyright

Copyright c© 2017 by Nathaniel E. Helwig

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 2

Page 3: Nathaniel E. Helwig

Outline of Notes

1) Classifying Two PopulationsOverview of ProblemCost of Misclassification

2) Two Multivariate NormalsEqual CovarianceUnequal Covariance

3) Evaluating ClassificationsMisclassification MeasuresQuality in LDA

4) Classifying g ≥ 2 PopulationsOverview of ProblemCost of MisclassificationDiscriminant Analysis

5) Iris Data ExampleData OverviewLDA ExampleQDA Example

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 3

Page 4: Nathaniel E. Helwig

Purpose of Discrimination and Classification

Discrimination attempts to separate distinct sets of objects, andclassification attempts to allocate new objects to predefined groups.

There are two typical goals of discrimination and classification:1 Data description: find “discriminants” that best separate groups2 Data allocation: put new objects in groups via the “discriminants”

Note that goal 1 is discrimination, and goal 2 is classification/allocation.

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 4

Page 5: Nathaniel E. Helwig

Classifying Two Populations

Classifying Two Populations

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 5

Page 6: Nathaniel E. Helwig

Classifying Two Populations Overview of Problem

The Two Population Classification Problem

Let X = (X1, . . . ,Xp)′ denote a random vector and letf1(x) denote the probability density function (pdf) for population π1

f2(x) denote the probability density function (pdf) for population π2

Problem: Given a realization X = x, we want to assign x to π1 or π2.

We want to find some classification rule to determine whether arealization X = x should be assigned to population π1 or π2.

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 6

Page 7: Nathaniel E. Helwig

Classifying Two Populations Overview of Problem

Visualizing a Classification Rule

Let Ω denote the sample space, i.e., all possible values of x, andR1 ⊂ Ω is the subset of Ω for which we classify x as π1R2 = Ω− R1 is the subset of Ω for which we classify x as π2

Figure: Figure 11.2 from Applied Multivariate Statistical Analysis, 6th Ed(Johnson & Wichern). Visualization is for p = 2 variables.Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 7

Page 8: Nathaniel E. Helwig

Classifying Two Populations Cost of Misclassification

Probability of Misclassification

The conditional probability P(2|1) of classifying an object as π2 whenthe object really belongs to π1 is given by

P(2|1) = P(X ∈ R2|π1) =

∫R2

f1(x)dx

The conditional probability P(1|2) of classifying an object as π1 whenthe object really belongs to π2 is given by

P(1|2) = P(X ∈ R1|π2) =

∫R1

f2(x)dx

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 8

Page 9: Nathaniel E. Helwig

Classifying Two Populations Cost of Misclassification

Visualizing the Probability of Misclassification

Figure: Figure 11.3 from Applied Multivariate Statistical Analysis, 6th Ed(Johnson & Wichern). Visualization is for p = 1 variable.

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 9

Page 10: Nathaniel E. Helwig

Classifying Two Populations Cost of Misclassification

Incorporating Prior Probabilities

Let p1 and p2 denote the prior probabilities that an object belongs to π1and π2, respectively, with the constraint that p1 + p2 = 1.

The overall probabilities of the four outcomes have the form

P(correctly classify as π1) = P(X ∈ R1|π1)P(π1) = P(1|1)p1

P(correctly classify as π2) = P(X ∈ R2|π2)P(π2) = P(2|2)p2

P(misclassify π1 as π2) = P(X ∈ R2|π1)P(π1) = P(2|1)p1

P(misclassify π2 as π1) = P(X ∈ R1|π2)P(π2) = P(1|2)p2

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 10

Page 11: Nathaniel E. Helwig

Classifying Two Populations Cost of Misclassification

Classification Table and Misclassification Costs

In many real world cases, costs of misclassification are not equal:π1 and π2 are diseased and healthyπ1 and π2 are guilty and not guiltyπ1 and π2 are buy and not buy stock

We can make a cost matrix to tabulate our misclassification costs:Classify as:π1 π2

Truth:π1 0 c(2|1)π2 c(1|2) 0

The expected cost of misclassification (ECM) is defined as

ECM = c(2|1)P(2|1)p1 + c(1|2)P(1|2)p2

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 11

Page 12: Nathaniel E. Helwig

Classifying Two Populations Cost of Misclassification

Classification Rule (Region) Minimizing ECM

The R1 and R2 that minimize the ECM are defined via the inequalities:

R1 :f1(x)

f2(x)≥(

c(1|2)

c(2|1)

)(p2

p1

)R2 :

f1(x)

f2(x)<

(c(1|2)

c(2|1)

)(p2

p1

)

If c(1|2) = c(2|1), then we are classifying via posterior probabilities.

If c(1|2) = c(2|1) and p1 = p2, then the classification rule reduces to

R1 :f1(x)

f2(x)≥ 1

R2 :f1(x)

f2(x)< 1

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 12

Page 13: Nathaniel E. Helwig

Classification with Two Multivariate Normal Populations

Two Multivariate NormalPopulations

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 13

Page 14: Nathaniel E. Helwig

Classification with Two Multivariate Normal Populations Equal Covariance Matrices

MVN Two Population Classification Problem

Let X = (X1, . . . ,Xp)′ denote a random vector and letf1(x) ∼ N(µ1,Σ) denote the pdf for population π1

f2(x) ∼ N(µ2,Σ) denote the pdf for population π2

Problem: Given a realization X = x, we want to assign x to π1 or π2.

We want to find some classification rule to determine whether arealization X = x should be assigned to population π1 or π2.

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 14

Page 15: Nathaniel E. Helwig

Classification with Two Multivariate Normal Populations Equal Covariance Matrices

Classification Rule Minimizing ECM

The multivariate normal densities have the form

fk (x) = (2π)−p/2|Σ|−1/2 exp−(1/2)(x− µk )′Σ−1(x− µk )

for k ∈ 1,2, which implies that

f ∗ =f1(x)

f2(x)= exp

−1

2(x− µ1)′Σ−1(x− µ1) +

12

(x− µ2)′Σ−1(x− µ2)

= exp

(µ1 − µ2)′Σ−1x− 1

2(µ1 − µ2)′Σ−1(µ1 + µ2)

The R1 and R2 that minimize the ECM are defined via the inequalities:

R1 : log(f ∗) ≥ log[(

c(1|2)

c(2|1)

)(p2

p1

)]R2 : log(f ∗) < log

[(c(1|2)

c(2|1)

)(p2

p1

)]Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 15

Page 16: Nathaniel E. Helwig

Classification with Two Multivariate Normal Populations Equal Covariance Matrices

Classification Rule in Practice

The rule on the previous slide depends on the population parametersµ1, µ2, and Σ, which are often unknown in practice.

Given n1 independent observations from π1 and n2 independentobservations from π2, we can estimate the needed parameters:

µ1 = x1 =1n1

n1∑i=1

xi(1) and µ2 = x2 =1n2

n2∑i=1

xi(2)

Σ = Sp =1

n1 + n2 − 2

[n1∑

i=1

(xi(1) − x1)(xi(1) − x1)′ +

n2∑i=1

(xi(2) − x2)(xi(2) − x2)′

]

The estimated classification rule replaces f ∗ with its sample estimate:

f ∗ = exp

(x1 − x2)′S−1p x− 1

2(x1 − x2)′S−1

p (x1 + x2)

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 16

Page 17: Nathaniel E. Helwig

Classification with Two Multivariate Normal Populations Equal Covariance Matrices

Classification Rule in Practice (continued)

If ν =(

c(1|2)c(2|1)

)(p2p1

)= 1, then the rule becomes

R1 : y ≥ mR2 : y < m

wherey = a′x and m =

12

(y1 + y2)

with a′ = (x1 − x2)′S−1p , y1 = a′x1, and y2 = a′x2

Scale of a is not uniquely determined, so normalize a using either:1 a∗ = a/‖a‖ (unit length)2 a∗ = a/a1 (first element 1)

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 17

Page 18: Nathaniel E. Helwig

Classification with Two Multivariate Normal Populations Equal Covariance Matrices

Fisher’s Linear Discriminant Function

R. A. Fisher arrived at the decision rule on the previous slide using anentirely different argument.

Fisher considered finding the linear combination Y = a′X that bestseparates the groups:

separation =|y1 − y2|

sy

wherey1 is the mean of the Y scores for the observations from π1

y2 is the mean of the Y scores for the observations from π2

s2y =

∑n1i=1(yi(1)−y1)2+

∑n2i=1(yi(2)−y2)2

n1+n2−2 is the pooled variance

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 18

Page 19: Nathaniel E. Helwig

Classification with Two Multivariate Normal Populations Equal Covariance Matrices

Fisher’s Linear Discriminant Function (continued)

Setting a′ = (x1 − x2)′S−1p maximizes the separation

separation2 =(y1 − y2)2

s2y

=(a′x1 − a′x2)2

a′Spa

=(a′d)2

a′Spa

= d′S−1p d

= D2

overall all possible a vectors, where d = x1 − x2.

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 19

Page 20: Nathaniel E. Helwig

Classification with Two Multivariate Normal Populations Equal Covariance Matrices

Visualizing Fisher’s Linear Discriminant Function

Figure: Figure 11.5 from Applied Multivariate Statistical Analysis, 6th Ed(Johnson & Wichern). Visualization is for p = 2 variables.

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 20

Page 21: Nathaniel E. Helwig

Classification with Two Multivariate Normal Populations Unequal Covariance Matrices

MVN Two Population Classification Problem (Σ1 6= Σ2)

Let X = (X1, . . . ,Xp)′ denote a random vector and letf1(x) ∼ N(µ1,Σ1) denote the pdf for population π1

f2(x) ∼ N(µ2,Σ2) denote the pdf for population π2

Problem: Given a realization X = x, we want to assign x to π1 or π2.

We want to find some classification rule to determine whether arealization X = x should be assigned to population π1 or π2.

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 21

Page 22: Nathaniel E. Helwig

Classification with Two Multivariate Normal Populations Unequal Covariance Matrices

Classification Rule Minimizing ECM (Σ1 6= Σ2)

The multivariate normal densities have the form

fk (x) = (2π)−p/2|Σk |−1/2 exp−(1/2)(x− µk )′Σ−1k (x− µk )

for k ∈ 1,2, which implies that

f ∗ =f1(x)f2(x)

=

(|Σ1||Σ2|

)−1/2

exp−1

2(x− µ1)

′Σ−11 (x− µ1) +

12(x− µ2)

′Σ−12 (x− µ2)

The R1 and R2 that minimize the ECM are defined via the inequalities:

R1 : log(f ∗) ≥ log[(

c(1|2)

c(2|1)

)(p2

p1

)]R2 : log(f ∗) < log

[(c(1|2)

c(2|1)

)(p2

p1

)]Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 22

Page 23: Nathaniel E. Helwig

Classification with Two Multivariate Normal Populations Unequal Covariance Matrices

Classification Rule in Practice (Σ1 6= Σ2)

The rule on the previous slide depends on the population parametersµ1, µ2, Σ1, and Σ2, which are often unknown in practice.

Given n1 independent observations from π1 and n2 independentobservations from π2, we can estimate the needed parameters:

µ1 = x1 =1n1

n1∑i=1

xi(1) and Σ1 = S1 =1

n1 − 1

n1∑i=1

(xi(1) − x1)(xi(1) − x1)′

µ2 = x2 =1n2

n2∑i=1

xi(2) and Σ2 = S2 =1

n2 − 1

n2∑i=1

(xi(2) − x2)(xi(2) − x2)′

The estimated classification rule replaces f ∗ with its sample estimate:

f ∗ =

(|S1||S2|

)−1/2

exp−1

2(x− x1)′S−1

1 (x− x1) +12

(x− x2)′S−12 (x− x2)

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 23

Page 24: Nathaniel E. Helwig

Classification with Two Multivariate Normal Populations Unequal Covariance Matrices

Classification Rule in Practice (Σ1 6= Σ2), continued

Note that we can write

log(f ∗) = log

[(|S1||S2|

)−1/2

e−12 (x−x1)′S−1

1 (x−x1)+ 12 (x−x2)′S−1

2 (x−x2)

]= y − m

where

y = −12

x′(S−11 − S−1

2 )x + (x′1S−11 − x′2S−1

2 )x

m =12

log(|S1||S2|

)+

12

(x′1S−11 x1 − x′2S−1

2 x2)

y is a quadratic function of x, so this a quadratic classification rule.

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 24

Page 25: Nathaniel E. Helwig

Classification with Two Multivariate Normal Populations Unequal Covariance Matrices

Caution: Quadratic Classification of Non-Normal Data

Figure: Figure 11.6 from Applied Multivariate Statistical Analysis, 6th Ed(Johnson & Wichern). Visualization is for p = 1 variable.Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 25

Page 26: Nathaniel E. Helwig

Evaluating Classification Functions

Evaluating ClassificationFunctions

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 26

Page 27: Nathaniel E. Helwig

Evaluating Classification Functions Misclassification Measures

Quantifying the Quality of a Classification Rule

To determine if a classification rule is “good” we can examine the errorrates, i.e., misclassification probabilities.

The population parameters are unknown in practice, so we focus onapproaches that can estimate the error rates from the observed data.

We want our classification rule to cross-validate to new data, so weconsider cross-validation procedures.

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 27

Page 28: Nathaniel E. Helwig

Evaluating Classification Functions Misclassification Measures

Total Probability of Misclassification

The Total Probability of Misclassification (TPM) is defined as

TPM(R1,R2) = p1

∫R2

f1(x)dx + p2

∫R1

f2(x)dx

for any classification rule (region) that partitions Ω = R1 ∪ R2.

The Optimum Error Rate (OER) is the minimum possible value of TPM

OER = minR1,R2

TPM(R1,R2) subject to Ω = R1 ∪ R2

which is obtained when R1 : f1(x)f2(x) ≥

p2p1

and R2 : f1(x)f2(x) <

p2p1

.

If c(1|2) = c(2|1), minimizing TPM is same as minimizing ECM

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 28

Page 29: Nathaniel E. Helwig

Evaluating Classification Functions Misclassification Measures

Actual Error Rate

The error rates on the previous slide require knowledge of the(typically unknown) parameters that define the densities f1(·) and f2(·).

Example: For LDA, calculating OER requires µ1, µ2, and Σ

The Actual Error Rate (AER) is defined using the sample estimates

AER(R1, R2) = p1

∫R2

f1(x)dx + p2

∫R1

f2(x)dx

where R1 and R2 denote estimates from samples sizes n1 and n2.

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 29

Page 30: Nathaniel E. Helwig

Evaluating Classification Functions Misclassification Measures

Apparent Error Rate

The Apparent Error Rate (APER) is an—optimistic—estimate of AER.Estimates the AER using the observed (training) sample of data

The confusion matrix for a sample of data is

Classified as:π1 π2

Truth:π1 nC1 nM1 n1π2 nM2 nC2 n2

wherenCk is the number correctly classified in population k ∈ 1,2nM1 = n1 − nC1 is the number from π1 that are misclassifiednM2 = n2 − nC2 is the number from π2 that are misclassified

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 30

Page 31: Nathaniel E. Helwig

Evaluating Classification Functions Misclassification Measures

Apparent Error Rate (continued)

Given a sample of data with confusion matrix

Classified as:π1 π2

Truth:π1 nC1 nM1 n1π2 nM2 nC2 n2

the APER is calculated as

APER =nM1 + nM2

n1 + n2

which is the total proportion of misclassified sample observations.

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 31

Page 32: Nathaniel E. Helwig

Evaluating Classification Functions Misclassification Measures

Leave-One-Out (Ordinary) Cross-Validation

Lachenbruch proposed a better approach to estimate the AER:1. Population 1 (for i = 1, . . . ,n1)

(a) Hold out the i-th observation from π1 and build classification rule(b) Use classification rule from Step 1(a) to classify the i-th observation

2. Population 2 (for i = 1, . . . ,n2)(a) Hold out the i-th observation from π2 and build classification rule(b) Use classification rule from Step 2(a) to classify the i-th observation

An (almost) unbiased estimate of the expected AER is given by

E(AER) =n∗M1 + n∗M2

n1 + n2

where n∗M1 and n∗M2 are the number of misclassified observations usingthe above “leave-one-out”’ procedure.

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 32

Page 33: Nathaniel E. Helwig

Evaluating Classification Functions Classification Quality in Linear Discriminant Analysis

Revisiting Linear Discriminant Analysis

Let X = (X1, . . . ,Xp)′ denote a random vector and letf1(x) ∼ N(µ1,Σ) denote the pdf for population π1

f2(x) ∼ N(µ2,Σ) denote the pdf for population π2

Reminder: assuming that(

c(1|2)c(2|1)

)(p2p1

)= 1, the classification rule is

R1 : Y ≥ mR2 : Y < m

whereY = a′X and m =

12

(µY1 + µY2)

with a′ = (µ1 − µ2)′Σ−1, µY1 = a′µ1, and µY2 = a′µ2

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 33

Page 34: Nathaniel E. Helwig

Evaluating Classification Functions Classification Quality in Linear Discriminant Analysis

Revisiting Linear Discriminant Analysis (continued)

Y = a′X = (µ1 − µ2)′Σ−1X is a linear function of X , so . . .µY1 = a′µ1 = (µ1 − µ2)′Σ−1µ1

µY2 = a′µ2 = (µ1 − µ2)′Σ−1µ2

σ2Y = a′Σa = (µ1 − µ2)′Σ−1(µ1 − µ2) = ∆2

And since X is multivariate normal, we have that

Y ∼

N(µY1 ,∆2) if from π1

N(µY2 ,∆2) if from π2

i.e., Y is univariate normal with population dependent mean.

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 34

Page 35: Nathaniel E. Helwig

Evaluating Classification Functions Classification Quality in Linear Discriminant Analysis

Visualizing Misclassification in LDA

Figure: Figure 11.7 from Applied Multivariate Statistical Analysis, 6th Ed(Johnson & Wichern).

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 35

Page 36: Nathaniel E. Helwig

Evaluating Classification Functions Classification Quality in Linear Discriminant Analysis

Calculating Misclassification in LDA (classify π1 as π2)

Defining m = (1/2)(µ1 − µ2)′Σ−1(µ1 + µ2), we have that

P(misclassify π1 as π2) = P(X ∈ R2|π1) = P(2|1)

= P (Y < m)

= P

(Y − µY1

σY<

m − (µ1 − µ2)Σ−1µ1∆

)

= P(

Z <−(1/2)∆2

)= Φ(−∆/2)

where Φ(·) denotes the CDF of the standard normal distribution.

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 36

Page 37: Nathaniel E. Helwig

Evaluating Classification Functions Classification Quality in Linear Discriminant Analysis

Calculating Misclassification in LDA (classify π2 as π1)

Defining m = (1/2)(µ1 − µ2)′Σ−1(µ1 + µ2), we have that

P(misclassify π2 as π1) = P(X ∈ R1|π2) = P(1|2)

= P (Y ≥ m)

= P

(Y − µY2

σY≥ m − (µ1 − µ2)Σ−1µ2

)

= P(

Z ≥ (1/2)∆2

)= 1−Φ(∆/2) = Φ(−∆/2)

where Φ(·) denotes the CDF of the standard normal distribution.

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 37

Page 38: Nathaniel E. Helwig

Evaluating Classification Functions Classification Quality in Linear Discriminant Analysis

Optimum Error Rate for Linear Discriminant Analysis

For the LDA classification rule, we have that

OER = minR1,R2

TPM(R1,R2)

=12

P(misclassify π1 as π2) +12

P(misclassify π2 as π1)

=12Φ(−∆/2) +

12

[1−Φ(∆/2)]

= Φ(−∆/2)

so the OER is a function of the ∆ effect size

∆ =

√(µ1 − µ2)′Σ−1(µ1 − µ2)

which is distance measure between µ1 and µ2.

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 38

Page 39: Nathaniel E. Helwig

Classifying g ≥ 2 Populations

Classifying g ≥ 2Populations

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 39

Page 40: Nathaniel E. Helwig

Classifying g ≥ 2 Populations Overview of Problem

The g Population Classification Problem

Let X = (X1, . . . ,Xp)′ denote a random vector and let fk (x) denote theprobability density function (pdf) for population πk for k ∈ 1, . . . ,g.

Problem: Given a realization X = x, we want to assign x to a πk .

We want to find some classification rule to determine whether arealization X = x should be assigned to population π1, π2, . . ., or πg .

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 40

Page 41: Nathaniel E. Helwig

Classifying g ≥ 2 Populations Overview of Problem

Classification Rule with g ≥ 2 Populations

Let Ω denote the sample space, i.e., all possible values of x, andR1 ⊂ Ω is the subset of Ω for which we classify x as π1

R2 ⊂ Ω is the subset of Ω for which we classify x as π2...

Rg ⊂ Ω is the subset of Ω for which we classify x as πg

Ω = R1 ∪ R2 ∪ · · · ∪ Rg and Rk ∩ R` = ∅ for all k 6= `.The classification rule partitions the sample spaceThe classification regions are mutually exclusive

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 41

Page 42: Nathaniel E. Helwig

Classifying g ≥ 2 Populations Overview of Problem

Visualizing a Classification Rule: g = 3 Populations

Figure: Figure 11.10 from Applied Multivariate Statistical Analysis, 6th Ed(Johnson & Wichern). Visualization is for p = 2 variables.

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 42

Page 43: Nathaniel E. Helwig

Classifying g ≥ 2 Populations Cost of Misclassification

Probability and Cost of Misclassification

The conditional probability P(`|k) of classifying an object as π` whenthe object really belongs to πk is given by

P(`|k) = P(X ∈ R`|πk ) =

∫R`

fk (x)dx

for all k 6= ` with k , ` ∈ 1, . . . ,g.

Note that P(k |k) = 1−∑

` 6=k P(`|k) by definition.

Let c(`|k) denote the cost of allocating an object to π` when the objectreally belongs to πk , and let pk denote the prior probability of πk .

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 43

Page 44: Nathaniel E. Helwig

Classifying g ≥ 2 Populations Cost of Misclassification

Expected Cost of Misclassification (revisited)

The conditional expected cost of misclassifying an object from πk is

ECM(k) =∑6=k

P(`|k)c(`|k)

Incorporating the prior probabilities, the overall ECM is given by

ECM =

g∑k=1

pkECM(k) =

g∑k=1

pk

∑6=k

P(`|k)c(`|k)

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 44

Page 45: Nathaniel E. Helwig

Classifying g ≥ 2 Populations Cost of Misclassification

Minimum ECM Classification Rule

The classification regions R1,R2, . . . ,Rg that minimize the ECM aredefined by allocating X = x to the population πk that minimizes∑

6=k

p`f`(x)c(k |`)

To understand the logic of the classification rule, suppose that we haveequal costs, i.e., c(`|k) = c(k |`) = 1 for all k , ` ∈ 1, . . . ,g

We allocate x to the population πk that minimizes∑

`6=k p`f`(x)

Minimizing∑

` 6=k p`f`(x) is the same as maximizing pk fk (x)

Allocate x to population πk if pk fk (x) > p`f`(x) for all ` 6= kThis is equivalent to maximizing the posterior probability P(πk |x)

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 45

Page 46: Nathaniel E. Helwig

Classifying g ≥ 2 Populations Fisher’s Discriminant Analysis

Overview of Fisher’s Approach

Fisher developed his discriminant analysis for g > 2 populations.

Idea: find a small number of linear combinations (e.g., a′1x, a′2x, a′3x)that best separate the groups.

Offers a simple and useful procedure for classification, which alsoprovides nice visualizations.

Plot the linear combinations to visualize the discriminants

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 46

Page 47: Nathaniel E. Helwig

Classifying g ≥ 2 Populations Fisher’s Discriminant Analysis

Assumptions of Fisher’s Discriminant Analysis

Let X = (X1, . . . ,Xp)′ denote a random vector and let fk (x) ∼ (µk ,Σ)denote the pdf for population πk .

Note the homogeneity of covariance matrix assumptionDo not need the multivariate normality assumption

Let µ = 1g∑g

k=1 µk denote the mean of the combined populations, and

Bµ =

g∑k=1

(µk − µ)(µk − µ)′

denote “Between” sum-of-squares and crossproducts (SSCP) matrix.

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 47

Page 48: Nathaniel E. Helwig

Classifying g ≥ 2 Populations Fisher’s Discriminant Analysis

Properties of a Linear Combination

Define new variable Y = a′X which has properties

E(Y |πk ) = a′E(X |πk ) = a′µk

V (Y |πk ) = a′V (X |πk )a = a′Σa

and note that the overall mean of Y has the form

µY =1g

g∑k=1

µYk =1g

g∑k=1

a′µk = a′µ

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 48

Page 49: Nathaniel E. Helwig

Classifying g ≥ 2 Populations Fisher’s Discriminant Analysis

Between versus Within Group Variability

Form the ratio of the between group separation over the variance of Y :

F ∗ =

∑gk=1(µYk − µY )2

σ2Y

=

∑gk=1(a′µk − a′µ)2

a′Σa

=a′[∑g

k=1(µk − µ)(µk − µ)′]

aa′Σa

=a′Bµaa′Σa

Note that higher F ∗ values relate to more separation between groups.

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 49

Page 50: Nathaniel E. Helwig

Classifying g ≥ 2 Populations Fisher’s Discriminant Analysis

Population Discriminants

The population k -th discriminant is the linear combination

Yk = a′kX

where ak is proportional to the k -th eigenvector of Σ−1Bµ.k = 1, . . . , s where s = min(g − 1,p)

The ak are scaled to make the Yk have unit variance, i.e., a′kΣak = 1.a′kΣa` = 0 for k 6= `

Note that this is only useful if we somehow know the true populationparameters µ1, . . . ,µg and Σ.

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 50

Page 51: Nathaniel E. Helwig

Classifying g ≥ 2 Populations Fisher’s Discriminant Analysis

Sample Discriminants

The sample estimated “Between” and “Within” SSCP matrices are

B =

g∑k=1

(xk − x)(xk − x)′ and W =

g∑k=1

nk∑i=1

(xi(k)− xk )(xi(k)− xk )′

where xk = 1nk

∑nki=1 xi(k) and x = 1

g∑g

k=1 xk .

The sample k -th discriminant is the linear combination

Yk = a′kX

where ak is proportional to the k -th eigenvector of W−1B.

The ak are scaled to make the Yk have unit variance, i.e., a′kΣak = 1,where Σ = Sp = 1

n−g W with n =∑g

k=1 nk .Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 51

Page 52: Nathaniel E. Helwig

Classifying g ≥ 2 Populations Fisher’s Discriminant Analysis

Properties of Population Discriminants

Let Y = A′X where A = [a1, . . . ,as].Y = (Y1, . . . ,Ys)′ contains the s discriminantsColumns of A contain the linear combination weights

The mean of Y is given by

E(Y |πk ) = A′E(X |πk ) = A′µk = µkY

and the covariance matrix for Y is

Cov(Y ) = A′Cov(X |πk )A = A′ΣA = Is

because the discriminants have unit variance and are uncorrelated.Remember: a′kΣa` = δk` where δk` is Kronecker’s δ

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 52

Page 53: Nathaniel E. Helwig

Classifying g ≥ 2 Populations Fisher’s Discriminant Analysis

Classifying New Objects with Discriminants

Given a realization X = x, define y = A′x and calculate the distancebetween the observed y = (y1, . . . , ys)′ and the k -th population mean:

Dk = (y− µkY )′(y− µkY ) =s∑`=1

(y` − µkY`)2 =

s∑`=1

[a′`(x− µk )]2

where µkY = A′µk and y` = a′`x and µkY`= a′`µk .

To build a distance using r ≤ s discriminants, use

D(r)k =

r∑`=1

(y` − µkY`)2 =

r∑`=1

[a′`(x− µk )]2

and classify x to the population πk that minimizes the distance D(r)k .

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 53

Page 54: Nathaniel E. Helwig

Classifying g ≥ 2 Populations Fisher’s Discriminant Analysis

Classifying New Objects with Sample Discriminants

Given a realization X = x, define y = A′x and calculate the distancebetween the observed y = (y1, . . . , ys)′ and the k -th sample mean:

Dk = (y− µkY )′(y− µkY ) =s∑`=1

(y` − µkY`)2 =

s∑`=1

[a′`(x− xk )]2

where µkY = A′xk and y` = a′`x and µkY`= a′`xk .

To build a distance using r ≤ s discriminants, use

D(r)k =

r∑`=1

(y` − µkY`)2 =

r∑`=1

[a′`(x− xk )]2

and classify x to the population πk that minimizes the distance D(r)k .

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 54

Page 55: Nathaniel E. Helwig

Classifying g ≥ 2 Populations Fisher’s Discriminant Analysis

Relation to MVN Classification Problem

Let X = (X1, . . . ,Xp)′ be a random vector and let fk (x) ∼ N(µk ,Σk )denote the pdf for population πk .

Assuming equal misclassification costs, we allocate X = x to thepopulation πk that minimizes

∑` 6=k p`f`(x)⇐⇒ maximizes pk fk (x).

Equivalent to allocating X = x to the population πk that maximizes

dQk (x) = Quadratic discriminant score

= −12

ln(|Σk |)−12

(x− µk )′Σ−1k (x− µk ) + ln(pk )

dLk (x) = Linear discriminant score

= µ′kΣ−1x− 1

2µ′kΣ

−1µk + ln(pk )

where dLk is used when Σk = Σ for all k ∈ 1, . . . ,g.

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 55

Page 56: Nathaniel E. Helwig

Classifying g ≥ 2 Populations Fisher’s Discriminant Analysis

Relation to MVN Classification Problem (continued)

If we assume that pk = 1/g for all k ∈ 1, . . . ,g, then

dLk (x) = µ′kΣ

−1x− 12µ′kΣ

−1µk

Define the linear combination yj = a′jx, where aj = Σ−1/2vj with vj

denoting the j-th eigenvector of Bµ = Σ−1/2BµΣ−1/2. Then

Dk =

p∑j=1

(yj − µkYj )2 =

p∑j=1

[a′j(x− µk )]2 = (x− µk )′Σ−1(x− µk )

= −2dLk (x) + α

where α = x′Σ−1x is constant across populations.

If rank(Bµ) = r , allocating to the population πk that maximizes dLk (x) is

equivalent to allocating to the population πk that minimizes D(r)k .

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 56

Page 57: Nathaniel E. Helwig

Fisher’s Iris Data Example

Fisher’s Iris Data Example

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 57

Page 58: Nathaniel E. Helwig

Fisher’s Iris Data Example Data Overview

Fisher’s (or Anderson’s) Famous Iris Data

R. A. Fisher published the LDA approach in 1936 and used EdgarAnderson’s iris flower dataset as an example.

The dataset consists of measurements of p = 4 variables taken fromnk = 50 flowers randomly sampled from each of g = 3 species.

Variables: Sepal Length, Sepal Width, Petal Length, Petal WidthSpecies: setosa, versicolor, virginica

The goal was/is to build a linear discriminant function that bestclassifies a new flower into one of the three species.

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 58

Page 59: Nathaniel E. Helwig

Fisher’s Iris Data Example Data Overview

Fisher’s Famous Iris Data in R

> head(iris)Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa2 4.9 3.0 1.4 0.2 setosa3 4.7 3.2 1.3 0.2 setosa4 4.6 3.1 1.5 0.2 setosa5 5.0 3.6 1.4 0.2 setosa6 5.4 3.9 1.7 0.4 setosa> colMeans(iris[iris$Species=="setosa",1:4])Sepal.Length Sepal.Width Petal.Length Petal.Width

5.006 3.428 1.462 0.246> colMeans(iris[iris$Species=="versicolor",1:4])Sepal.Length Sepal.Width Petal.Length Petal.Width

5.936 2.770 4.260 1.326> colMeans(iris[iris$Species=="virginica",1:4])Sepal.Length Sepal.Width Petal.Length Petal.Width

6.588 2.974 5.552 2.026> p <- 4L> g <- 3L

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 59

Page 60: Nathaniel E. Helwig

Fisher’s Iris Data Example Data Overview

Make Pooled Covariance Matrix

# make pooled covariances matrix> Sp <- matrix(0, p, p)> nx <- rep(0, g)> lev <- levels(iris$Species)> for(k in 1:g)+ x <- iris[iris$Species==lev[k],1:p]+ nx[k] <- nrow(x)+ Sp <- Sp + cov(x) * (nx[k] - 1)+ > Sp <- Sp / (sum(nx) - g)> round(Sp, 3)

Sepal.Length Sepal.Width Petal.Length Petal.WidthSepal.Length 0.265 0.093 0.168 0.038Sepal.Width 0.093 0.115 0.055 0.033Petal.Length 0.168 0.055 0.185 0.043Petal.Width 0.038 0.033 0.043 0.042

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 60

Page 61: Nathaniel E. Helwig

Fisher’s Iris Data Example Linear Discriminant Analysis

LDA in R via the lda Function (MASS Package)

# fit lda model> library(MASS)> ldamod <- lda(Species ~ ., data=iris, prior=rep(1/3, 3))

# check the LDA coefficients/scalings> ldamod$scaling

LD1 LD2Sepal.Length 0.8293776 0.02410215Sepal.Width 1.5344731 2.16452123Petal.Length -2.2012117 -0.93192121Petal.Width -2.8104603 2.83918785> crossprod(ldamod$scaling, Sp) %*% ldamod$scaling

LD1 LD2LD1 1.00000e+00 -7.21645e-16LD2 -7.21645e-16 1.00000e+00

# create the (centered) discriminant scores> mu.k <- ldamod$means> mu <- colMeans(mu.k)> dscores <- scale(iris[,1:p], center=mu, scale=F) %*% ldamod$scaling> sum((dscores - predict(ldamod)$x)^2)[1] 1.658958e-28

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 61

Page 62: Nathaniel E. Helwig

Fisher’s Iris Data Example Linear Discriminant Analysis

Plot LDA Results: Score and Coefficients

−10 −5 0 5 10

−3

−2

−1

01

23

Discriminant Scores

LD1

LD2

setosaversicolorvirginica

−4 −3 −2 −1 0 1 2 3

−1

01

23

Discriminant Coefficients

LD1LD

2

Sepal.Length

Sepal.Width

Petal.Length

Petal.Width

R code for left plot:plot(dscores, xlab="LD1", ylab="LD2", pch=spid, col=spid,

main="Discriminant Scores", xlim=c(-10, 10), ylim=c(-3, 3))legend("top",lev,pch=1:3,col=1:3,bty="n")

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 62

Page 63: Nathaniel E. Helwig

Fisher’s Iris Data Example Linear Discriminant Analysis

Plot LDA Results: Discriminant Partitions

−5 0 5

−2

−1

01

2

LD1

LD2 s

s

s

s

s

s

ss

s s

s

s

ss

s

s

s

sss

s

ss

s

s

s

sss

ss

s

s

s

s

s

ss

s

s

s

s

s

ss

s

s

s

s

sc

c

c

c

c

c

c

c

c

c

c

c

c

c

ccc

cc c

c

c

c c

c

c

c

cc

c

cc

cc

c

c

c

c

c

c

c

c

c

c

cccc

cc

v

v

v

v

v

vv

v

v

v

v

v

v

v

v

v

v

v

v

v

v

v

v

v

v

vvv

v

vv

v

v

v

v

v

v

v

v

v

vv

v

v

v

v

v

v

v

v

app. error rate: 0.02

Partition Plot

library(klaR)species <- factor(rep(c("s","c","v"), each=50))partimat(x=dscores[,2:1], grouping=species, method="lda")

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 63

Page 64: Nathaniel E. Helwig

Fisher’s Iris Data Example Linear Discriminant Analysis

Plot LDA Results: All Pairwise Partitions

2.0 2.5 3.0 3.5 4.0

4.5

5.5

6.5

7.5

Sepal.Width

Sep

al.L

engt

h

ss

ss

s

s

s

s

s

s

s

ss

s

s ss

s

s

ss

s

s

ss

s sss

ss

ss

s

s s

s

s

s

s s

s s

s ss

s

s

ss

c

c

c

c

c

c

c

c

c

cc

cc c

c

c

cc

c

cc

cc

cc

cc c

cc

cccc

c

c

c

c

cc c

cc

c

c cc

c

c

c

v

v

v

vv

v

v

v

v

v

vv

v

v v

vv

vv

v

v

v

v

v

v

v

v vv

vv

v

vvv

v

vv

v

vvv

v

v vv

vv

vv

app. error rate: 0.2

1 2 3 4 5 6

4.5

5.5

6.5

7.5

Petal.LengthS

epal

.Len

gth

ss

s s

s

s

s

s

s

s

s

ss

s

s sss

s

ss

s

s

ss

ssss

ss

ss

s

ss

s

s

s

ss

ss

s ss

s

s

ss

c

c

c

c

c

c

c

c

c

cc

cc c

c

c

cc

c

cc

cc

ccc

c c

cc

ccc

c

c

c

c

c

cc c

cc

c

ccc

c

c

c

v

v

v

vv

v

v

v

v

v

v v

v

vv

v v

v v

v

v

v

v

v

v

v

vvv

vv

v

vvv

v

vv

v

vv

v

v

vvv

vv

vv

app. error rate: 0.033

1 2 3 4 5 6

2.0

2.5

3.0

3.5

4.0

Petal.Length

Sep

al.W

idth s

ss s

s

s

ss

ss

s

s

ss

s

s

s

s

ss

s

ss

s s

s

sssss

s

ss

ss

ss

s

ss

s

s

s

s

s

s

s

s

s cc c

c

cc

c

c

cc

c

c

c

cccc

c

c

c

c

c

c

ccc ccc

ccc

c c

c

c

c

c

c

c c

c

c

c

c

ccc

c

c

v

v

vv v v

v

v

v

v

v

v

v

v

v

vv

v

v

v

v

v vv

v v

vv

vv

v

v

vvv

v

v

vv v vv

v

vv

v

v

v

v

v

app. error rate: 0.047

0.5 1.0 1.5 2.0 2.5

4.5

5.5

6.5

7.5

Petal.Width

Sep

al.L

engt

h

ssss

s

s

s

s

s

s

s

ss

s

s ss

s

s

ss

s

s

sss sss

ss

ss

s

ss

s

s

s

s s

ss

sss

s

s

ss

c

c

c

c

c

c

c

c

c

cc

cc c

c

c

cc

c

cc

cc

cc

cc c

cc

ccc

c

c

c

c

c

ccc

cc

c

cc c

c

c

c

v

v

v

vv

v

v

v

v

v

vv

v

v v

vv

v v

v

v

v

v

v

v

v

vvv

vv

v

vvv

v

vv

v

vv

v

v

v vv

vv

vv

app. error rate: 0.04

0.5 1.0 1.5 2.0 2.5

2.0

2.5

3.0

3.5

4.0

Petal.Width

Sep

al.W

idth s

sss

s

s

ss

ss

s

s

ss

s

s

s

s

ss

s

ss

ss

s

sssss

s

s s

ss

ss

s

s s

s

s

s

s

s

s

s

s

s c cc

c

cc

c

c

cc

c

c

c

ccc c

c

c

c

c

c

c

c c cc

cc

ccc

c c

c

c

c

c

c

cc

c

c

c

c

c cc

c

c

v

v

vv vv

v

v

v

v

v

v

v

v

v

vv

v

v

v

v

vvv

vv

vv

vv

v

v

vvv

v

v

vv v vv

v

v v

v

v

v

v

v

app. error rate: 0.033

0.5 1.0 1.5 2.0 2.5

12

34

56

Petal.Width

Pet

al.L

engt

h

ssssss

ssss ssss s

ssssss s

s

sss sssss ss sssss ss sss

ss

sssss

c cc

ccc c

c

c

cc

cc

c

c

c cc

cc

c

c

ccc c

c cc

ccc c

cc cc

ccc

c cc

c

cc cc

c

c

v

v

vv v

v

v

vv

v

vv vv vvv

v v

v

v

v

v

v

vv

vv

vvv

v

vv

vv

vv

vv v

vv

v vvv v v

v

app. error rate: 0.04

Partition Plot

library(klaR)species <- factor(rep(c("s","c","v"), each=50))partimat(x=iris[,1:4], grouping=species, method="lda")

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 64

Page 65: Nathaniel E. Helwig

Fisher’s Iris Data Example Linear Discriminant Analysis

APER and Expected AER

# make confusion matrix (and APER)> confusion <- table(iris$Species, predict(ldamod)$class)> confusion

setosa versicolor virginicasetosa 50 0 0versicolor 0 48 2virginica 0 1 49

> n <- sum(confusion)> aper <- (n - sum(diag(confusion))) / n> aper[1] 0.02

# use CV to get expected AER> ldamodCV <- lda(Species ~ ., data=iris, prior=rep(1/3, 3), CV=TRUE)> confusionCV <- table(iris$Species, ldamodCV$class)> confusionCV

setosa versicolor virginicasetosa 50 0 0versicolor 0 48 2virginica 0 1 49

> eaer <- (n - sum(diag(confusionCV))) / n> eaer[1] 0.02

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 65

Page 66: Nathaniel E. Helwig

Fisher’s Iris Data Example Linear Discriminant Analysis

Split Data into Training (70%) and Testing (30%) Sets

> # split into separate matrices for each flower> Xs <- subset(iris, Species=="setosa")> Xc <- subset(iris, Species=="versicolor")> Xv <- subset(iris, Species=="virginica")

# split into training and testing> set.seed(1)> sid <- sample.int(n=50, size=35)> cid <- sample.int(n=50, size=35)> vid <- sample.int(n=50, size=35)> Xtrain <- rbind(Xs[sid,], Xc[cid,], Xv[vid,])> Xtest <- rbind(Xs[-sid,], Xc[-cid,], Xv[-vid,])

# fit lda to training and evaluate on testing> ldatrain <- lda(Species ~ ., data=Xtrain, prior=rep(1/3, 3))> confusionTest <- table(Xtest$Species, predict(ldatrain, newdata=Xtest)$class)> confusionTest

setosa versicolor virginicasetosa 15 0 0versicolor 0 15 0virginica 0 1 14

> n <- sum(confusionTest)> aer <- (n - sum(diag(confusionTest))) / n> aer[1] 0.02222222

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 66

Page 67: Nathaniel E. Helwig

Fisher’s Iris Data Example Linear Discriminant Analysis

Two-Fold CV with 100 Random 70/30 Splits

> nrep <- 100> aer <- rep(0, nrep)> set.seed(1)> for(k in 1:nrep)+ sid <- sample.int(n=50, size=35)+ cid <- sample.int(n=50, size=35)+ vid <- sample.int(n=50, size=35)+ Xtrain <- rbind(Xs[sid,], Xc[cid,], Xv[vid,])+ Xtest <- rbind(Xs[-sid,], Xc[-cid,], Xv[-vid,])+ ldatrain <- lda(Species ~ ., data=Xtrain, prior=rep(1/3, 3))+ confusionTest <- table(Xtest$Species, predict(ldatrain, newdata=Xtest)$class)+ confusionTest+ n <- sum(confusionTest)+ aer[k] <- (n - sum(diag(confusionTest))) / n+ > mean(aer)[1] 0.022

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 67

Page 68: Nathaniel E. Helwig

Fisher’s Iris Data Example Quadratic Discriminant Analysis

QDA in R via the qda Function (MASS Package)

# fit qda model> library(MASS)> qdamod <- qda(Species ~ ., data=iris, prior=rep(1/3, 3))> names(qdamod)[1] "prior" "counts" "means" "scaling" "ldet" "lev" "N"[8] "call" "terms" "xlevels"

# check the QDA coefficients/scalings> dim(qdamod$scaling)[1] 4 4 3> round(crossprod(qdamod$scaling[,,1], cov(Xs[,1:p])) %*% qdamod$scaling[,,1], 4)1 2 3 4

1 1 0 0 02 0 1 0 03 0 0 1 04 0 0 0 1> round(crossprod(qdamod$scaling[,,2], cov(Xc[,1:p])) %*% qdamod$scaling[,,2], 4)1 2 3 4

1 1 0 0 02 0 1 0 03 0 0 1 04 0 0 0 1> round(crossprod(qdamod$scaling[,,3], cov(Xv[,1:p])) %*% qdamod$scaling[,,3], 4)1 2 3 4

1 1 0 0 02 0 1 0 03 0 0 1 04 0 0 0 1

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 68

Page 69: Nathaniel E. Helwig

Fisher’s Iris Data Example Quadratic Discriminant Analysis

Plot QDA Results: All Pairwise Partitions

2.0 2.5 3.0 3.5 4.0

4.5

5.5

6.5

7.5

Sepal.Width

Sep

al.L

engt

h

ss

ss

s

s

s

s

s

s

s

ss

s

s ss

s

s

ss

s

s

ss

s sss

ss

ss

s

s s

s

s

s

s s

s s

s ss

s

s

ss

v

v

v

v

v

v

v

v

v

vv

vv v

v

v

vv

v

vv

vv

vv

vv v

vv

vvvv

v

v

v

v

vv v

vv

v

v vv

v

v

v

v

v

v

vv

v

v

v

v

v

vv

v

v v

vv

vv

v

v

v

v

v

v

v

v vv

vv

v

vvv

v

vv

v

vvv

v

v vv

vv

vv

app. error rate: 0.2

1 2 3 4 5 6

4.5

5.5

6.5

7.5

Petal.LengthS

epal

.Len

gth

ss

s s

s

s

s

s

s

s

s

ss

s

s sss

s

ss

s

s

ss

ssss

ss

ss

s

ss

s

s

s

ss

ss

s ss

s

s

ss

v

v

v

v

v

v

v

v

v

vv

vv v

v

v

vv

v

vv

vv

vvv

v v

vv

vvv

v

v

v

v

v

vv v

vv

v

vvv

v

v

v

v

v

v

vv

v

v

v

v

v

v v

v

vv

v v

v v

v

v

v

v

v

v

v

vvv

vv

v

vvv

v

vv

v

vv

v

v

vvv

vv

vv

app. error rate: 0.04

1 2 3 4 5 6

2.0

2.5

3.0

3.5

4.0

Petal.Length

Sep

al.W

idth s

ss s

s

s

ss

ss

s

s

ss

s

s

s

s

ss

s

ss

s s

s

sssss

s

ss

ss

ss

s

ss

s

s

s

s

s

s

s

s

s vv v

v

vv

v

v

vv

v

v

v

vvvv

v

v

v

v

v

v

vvv vvv

vvv

v v

v

v

v

v

v

v v

v

v

v

v

vvv

v

v

v

v

vv v v

v

v

v

v

v

v

v

v

v

vv

v

v

v

v

v vv

v v

vv

vv

v

v

vvv

v

v

vv v vv

v

vv

v

v

v

v

v

app. error rate: 0.047

0.5 1.0 1.5 2.0 2.5

4.5

5.5

6.5

7.5

Petal.Width

Sep

al.L

engt

h

ssss

s

s

s

s

s

s

s

ss

s

s ss

s

s

ss

s

s

sss sss

ss

ss

s

ss

s

s

s

s s

ss

sss

s

s

ss

v

v

v

v

v

v

v

v

v

vv

vv v

v

v

vv

v

vv

vv

vv

vv v

vv

vvv

v

v

v

v

v

vvv

vv

v

vv v

v

v

v

v

v

v

vv

v

v

v

v

v

vv

v

v v

vv

v v

v

v

v

v

v

v

v

vvv

vv

v

vvv

v

vv

v

vv

v

v

v vv

vv

vv

app. error rate: 0.033

0.5 1.0 1.5 2.0 2.5

2.0

2.5

3.0

3.5

4.0

Petal.Width

Sep

al.W

idth s

sss

s

s

ss

ss

s

s

ss

s

s

s

s

ss

s

ss

ss

s

sssss

s

s s

ss

ss

s

s s

s

s

s

s

s

s

s

s

s v vv

v

vv

v

v

vv

v

v

v

vvv v

v

v

v

v

v

v

v v vv

vv

vvv

v v

v

v

v

v

v

vv

v

v

v

v

v vv

v

v

v

v

vv vv

v

v

v

v

v

v

v

v

v

vv

v

v

v

v

vvv

vv

vv

vv

v

v

vvv

v

v

vv v vv

v

v v

v

v

v

v

v

app. error rate: 0.047

0.5 1.0 1.5 2.0 2.5

12

34

56

Petal.Width

Pet

al.L

engt

h

ssssss

ssss ssss s

ssssss s

s

sss sssss ss sssss ss sss

ss

sssss

v vv

vvv v

v

v

vv

vv

v

v

v vv

vv

v

v

vvv v

v vv

vvv v

vv vv

vvv

v vv

v

vv vv

v

v

v

v

vv v

v

v

vv

v

vv vv vvv

v v

v

v

v

v

v

vv

vv

vvv

v

vv

vv

vv

vv v

vv

v vvv v v

v

app. error rate: 0.02

Partition Plot

library(klaR)species <- factor(rep(c("s","c","v"), each=50))partimat(x=iris[,1:4], grouping=species, method="qda")

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 69

Page 70: Nathaniel E. Helwig

Fisher’s Iris Data Example Quadratic Discriminant Analysis

APER and Expected AER

# make confusion matrix (and APER)> confusion <- table(iris$Species, predict(qdamod)$class)> confusion

setosa versicolor virginicasetosa 50 0 0versicolor 0 48 2virginica 0 1 49

> n <- sum(confusion)> aper <- (n - sum(diag(confusion))) / n> aper[1] 0.02

# use CV to get expected AER> qdamodCV <- qda(Species ~ ., data=iris, prior=rep(1/3, 3), CV=TRUE)> confusionCV <- table(iris$Species, qdamodCV$class)> confusionCV

setosa versicolor virginicasetosa 50 0 0versicolor 0 47 3virginica 0 1 49

> eaer <- (n - sum(diag(confusionCV))) / n> eaer[1] 0.02666667

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 70

Page 71: Nathaniel E. Helwig

Fisher’s Iris Data Example Quadratic Discriminant Analysis

Split Data into Training (70%) and Testing (30%) Sets

> # split into separate matrices for each flower> Xs <- subset(iris, Species=="setosa")> Xc <- subset(iris, Species=="versicolor")> Xv <- subset(iris, Species=="virginica")

> # split into training and testing> set.seed(1)> sid <- sample.int(n=50, size=35)> cid <- sample.int(n=50, size=35)> vid <- sample.int(n=50, size=35)> Xtrain <- rbind(Xs[sid,], Xc[cid,], Xv[vid,])> Xtest <- rbind(Xs[-sid,], Xc[-cid,], Xv[-vid,])

# fit qda to training and evaluate on testing> qdatrain <- qda(Species ~ ., data=Xtrain, prior=rep(1/3, 3))> confusionTest <- table(Xtest$Species, predict(qdatrain, newdata=Xtest)$class)> confusionTest

setosa versicolor virginicasetosa 15 0 0versicolor 0 15 0virginica 0 1 14

> n <- sum(confusionTest)> aer <- (n - sum(diag(confusionTest))) / n> aer[1] 0.02222222

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 71

Page 72: Nathaniel E. Helwig

Fisher’s Iris Data Example Quadratic Discriminant Analysis

Two-Fold CV with 100 Random 70/30 Splits

> nrep <- 100> aer <- rep(0, nrep)> set.seed(1)> for(k in 1:nrep)+ sid <- sample.int(n=50, size=35)+ cid <- sample.int(n=50, size=35)+ vid <- sample.int(n=50, size=35)+ Xtrain <- rbind(Xs[sid,], Xc[cid,], Xv[vid,])+ Xtest <- rbind(Xs[-sid,], Xc[-cid,], Xv[-vid,])+ qdatrain <- qda(Species ~ ., data=Xtrain, prior=rep(1/3, 3))+ confusionTest <- table(Xtest$Species, predict(qdatrain, newdata=Xtest)$class)+ confusionTest+ n <- sum(confusionTest)+ aer[k] <- (n - sum(diag(confusionTest))) / n+ > mean(aer)[1] 0.02466667

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 72

Page 73: Nathaniel E. Helwig

Fisher’s Iris Data Example Quadratic Discriminant Analysis

Plot LDA and QDA Results using PCA

−4 −2 0 2 4

−2

−1

01

2

LDA Results

PC1

PC

2

setosaversicolorvirginica

−4 −2 0 2 4−

2−

10

12

QDA Results

PC1

PC

2

setosaversicolorvirginica

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 73

Page 74: Nathaniel E. Helwig

Fisher’s Iris Data Example Quadratic Discriminant Analysis

Plot LDA and QDA Results using PCA (R code)

R code for plot on previous slide:# visualize LDA and QDA results via PCAldaid <- as.integer(predict(ldamod)$class)qdaid <- as.integer(predict(qdamod)$class)pcamod <- princomp(iris[,1:4])dev.new(width=10, height=5, noRStudioGD=TRUE)par(mfrow=c(1,2))plot(pcamod$scores[,1:2], xlab="PC1", ylab="PC2", pch=ldaid, col=ldaid,

main="LDA Results", xlim=c(-4, 4), ylim=c(-2, 2))legend("topright",lev,pch=1:3,col=1:3,bty="n")abline(h=0,lty=3)abline(v=0,lty=3)plot(pcamod$scores[,1:2], xlab="PC1", ylab="PC2", pch=qdaid, col=qdaid,

main="QDA Results", xlim=c(-4, 4), ylim=c(-2, 2))legend("topright",lev,pch=1:3,col=1:3,bty="n")abline(h=0,lty=3)abline(v=0,lty=3)

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 74