Nathaniel E. Helwig

Discrimination and Classification

Nathaniel E. Helwig

Assistant Professor of Psychology and StatisticsUniversity of Minnesota (Twin Cities)

Updated 14-Mar-2017

Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 1

Copyright

Copyright c© 2017 by Nathaniel E. Helwig


Outline of Notes

1) Classifying Two PopulationsOverview of ProblemCost of Misclassification

2) Two Multivariate NormalsEqual CovarianceUnequal Covariance

3) Evaluating ClassificationsMisclassification MeasuresQuality in LDA

4) Classifying g ≥ 2 PopulationsOverview of ProblemCost of MisclassificationDiscriminant Analysis

5) Iris Data ExampleData OverviewLDA ExampleQDA Example


Purpose of Discrimination and Classification

Discrimination attempts to separate distinct sets of objects, andclassification attempts to allocate new objects to predefined groups.

There are two typical goals of discrimination and classification:1 Data description: find “discriminants” that best separate groups2 Data allocation: put new objects in groups via the “discriminants”

Note that goal 1 is discrimination, and goal 2 is classification/allocation.


Classifying Two Populations

Classifying Two Populations


Classifying Two Populations Overview of Problem

The Two Population Classification Problem

Let X = (X1, . . . ,Xp)′ denote a random vector and letf1(x) denote the probability density function (pdf) for population π1

f2(x) denote the probability density function (pdf) for population π2

Problem: Given a realization X = x, we want to assign x to π1 or π2.

We want to find some classification rule to determine whether arealization X = x should be assigned to population π1 or π2.


Classifying Two Populations Overview of Problem

Visualizing a Classification Rule

Let Ω denote the sample space, i.e., all possible values of x, andR1 ⊂ Ω is the subset of Ω for which we classify x as π1R2 = Ω− R1 is the subset of Ω for which we classify x as π2

Figure: Figure 11.2 from Applied Multivariate Statistical Analysis, 6th Ed(Johnson & Wichern). Visualization is for p = 2 variables.Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 7

Classifying Two Populations Cost of Misclassification

Probability of Misclassification

The conditional probability P(2|1) of classifying an object as π2 whenthe object really belongs to π1 is given by

P(2|1) = P(X ∈ R2|π1) =

∫R2

f1(x)dx

The conditional probability P(1|2) of classifying an object as π1 whenthe object really belongs to π2 is given by

P(1|2) = P(X ∈ R1|π2) =

∫R1

f2(x)dx



Visualizing the Probability of Misclassification

Figure: Figure 11.3 from Applied Multivariate Statistical Analysis, 6th Ed(Johnson & Wichern). Visualization is for p = 1 variable.



Incorporating Prior Probabilities

Let p1 and p2 denote the prior probabilities that an object belongs to π1and π2, respectively, with the constraint that p1 + p2 = 1.

The overall probabilities of the four outcomes have the form

P(correctly classify as π1) = P(X ∈ R1|π1)P(π1) = P(1|1)p1

P(correctly classify as π2) = P(X ∈ R2|π2)P(π2) = P(2|2)p2

P(misclassify π1 as π2) = P(X ∈ R2|π1)P(π1) = P(2|1)p1

P(misclassify π2 as π1) = P(X ∈ R1|π2)P(π2) = P(1|2)p2



Classification Table and Misclassification Costs

In many real world cases, costs of misclassification are not equal:π1 and π2 are diseased and healthyπ1 and π2 are guilty and not guiltyπ1 and π2 are buy and not buy stock

We can make a cost matrix to tabulate our misclassification costs:Classify as:π1 π2

Truth:π1 0 c(2|1)π2 c(1|2) 0

The expected cost of misclassification (ECM) is defined as

ECM = c(2|1)P(2|1)p1 + c(1|2)P(1|2)p2



Classification Rule (Region) Minimizing ECM

The R1 and R2 that minimize the ECM are defined via the inequalities:

R1 :f1(x)

f2(x)≥(

c(1|2)

c(2|1)

)(p2

p1

)R2 :

f1(x)

f2(x)<

(c(1|2)

c(2|1)

)(p2

p1

)

If c(1|2) = c(2|1), then we are classifying via posterior probabilities.

If c(1|2) = c(2|1) and p1 = p2, then the classification rule reduces to

R1 :f1(x)

f2(x)≥ 1

R2 :f1(x)

f2(x)< 1


Classification with Two Multivariate Normal Populations

Two Multivariate NormalPopulations


Classification with Two Multivariate Normal Populations Equal Covariance Matrices

MVN Two Population Classification Problem

Let X = (X1, . . . ,Xp)′ denote a random vector and letf1(x) ∼ N(µ1,Σ) denote the pdf for population π1

f2(x) ∼ N(µ2,Σ) denote the pdf for population π2





Classification Rule Minimizing ECM

The multivariate normal densities have the form

fk (x) = (2π)−p/2|Σ|−1/2 exp−(1/2)(x− µk )′Σ−1(x− µk )

for k ∈ 1,2, which implies that

f ∗ =f1(x)

f2(x)= exp

−1

2(x− µ1)′Σ−1(x− µ1) +

12

(x− µ2)′Σ−1(x− µ2)

= exp

(µ1 − µ2)′Σ−1x− 1

2(µ1 − µ2)′Σ−1(µ1 + µ2)


R1 : log(f ∗) ≥ log[(

c(1|2)

c(2|1)

)(p2

p1

)]R2 : log(f ∗) < log

[(c(1|2)

c(2|1)

)(p2

p1

)]Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 15


Classification Rule in Practice

The rule on the previous slide depends on the population parametersµ1, µ2, and Σ, which are often unknown in practice.

Given n1 independent observations from π1 and n2 independentobservations from π2, we can estimate the needed parameters:

µ1 = x1 =1n1

n1∑i=1

xi(1) and µ2 = x2 =1n2

n2∑i=1

xi(2)

Σ = Sp =1

n1 + n2 − 2

[n1∑

i=1

(xi(1) − x1)(xi(1) − x1)′ +

n2∑i=1

(xi(2) − x2)(xi(2) − x2)′

]

The estimated classification rule replaces f ∗ with its sample estimate:

f ∗ = exp

(x1 − x2)′S−1p x− 1

2(x1 − x2)′S−1

p (x1 + x2)



Classification Rule in Practice (continued)

If ν =(

c(1|2)c(2|1)

)(p2p1

)= 1, then the rule becomes

R1 : y ≥ mR2 : y < m

wherey = a′x and m =

12

(y1 + y2)

with a′ = (x1 − x2)′S−1p , y1 = a′x1, and y2 = a′x2

Scale of a is not uniquely determined, so normalize a using either:1 a∗ = a/‖a‖ (unit length)2 a∗ = a/a1 (first element 1)



Fisher’s Linear Discriminant Function

R. A. Fisher arrived at the decision rule on the previous slide using anentirely different argument.

Fisher considered finding the linear combination Y = a′X that bestseparates the groups:

separation =|y1 − y2|

sy

wherey1 is the mean of the Y scores for the observations from π1

y2 is the mean of the Y scores for the observations from π2

s2y =

∑n1i=1(yi(1)−y1)2+

∑n2i=1(yi(2)−y2)2

n1+n2−2 is the pooled variance



Fisher’s Linear Discriminant Function (continued)

Setting a′ = (x1 − x2)′S−1p maximizes the separation

separation2 =(y1 − y2)2

s2y

=(a′x1 − a′x2)2

a′Spa

=(a′d)2

a′Spa

= d′S−1p d

= D2

overall all possible a vectors, where d = x1 − x2.



Visualizing Fisher’s Linear Discriminant Function

Figure: Figure 11.5 from Applied Multivariate Statistical Analysis, 6th Ed(Johnson & Wichern). Visualization is for p = 2 variables.


Classification with Two Multivariate Normal Populations Unequal Covariance Matrices

MVN Two Population Classification Problem (Σ1 6= Σ2)

Let X = (X1, . . . ,Xp)′ denote a random vector and letf1(x) ∼ N(µ1,Σ1) denote the pdf for population π1

f2(x) ∼ N(µ2,Σ2) denote the pdf for population π2





Classification Rule Minimizing ECM (Σ1 6= Σ2)

The multivariate normal densities have the form

fk (x) = (2π)−p/2|Σk |−1/2 exp−(1/2)(x− µk )′Σ−1k (x− µk )

for k ∈ 1,2, which implies that

f ∗ =f1(x)f2(x)

=

(|Σ1||Σ2|

)−1/2

exp−1

2(x− µ1)

′Σ−11 (x− µ1) +

12(x− µ2)

′Σ−12 (x− µ2)


R1 : log(f ∗) ≥ log[(

c(1|2)

c(2|1)

)(p2

p1

)]R2 : log(f ∗) < log

[(c(1|2)

c(2|1)

)(p2

p1

)]Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 22


Classification Rule in Practice (Σ1 6= Σ2)

The rule on the previous slide depends on the population parametersµ1, µ2, Σ1, and Σ2, which are often unknown in practice.

Given n1 independent observations from π1 and n2 independentobservations from π2, we can estimate the needed parameters:

µ1 = x1 =1n1

n1∑i=1

xi(1) and Σ1 = S1 =1

n1 − 1

n1∑i=1

(xi(1) − x1)(xi(1) − x1)′

µ2 = x2 =1n2

n2∑i=1

xi(2) and Σ2 = S2 =1

n2 − 1

n2∑i=1

(xi(2) − x2)(xi(2) − x2)′

The estimated classification rule replaces f ∗ with its sample estimate:

f ∗ =

(|S1||S2|

)−1/2

exp−1

2(x− x1)′S−1

1 (x− x1) +12

(x− x2)′S−12 (x− x2)



Classification Rule in Practice (Σ1 6= Σ2), continued

Note that we can write

log(f ∗) = log

[(|S1||S2|

)−1/2

e−12 (x−x1)′S−1

1 (x−x1)+ 12 (x−x2)′S−1

2 (x−x2)

]= y − m

where

y = −12

x′(S−11 − S−1

2 )x + (x′1S−11 − x′2S−1

2 )x

m =12

log(|S1||S2|

)+

12

(x′1S−11 x1 − x′2S−1

2 x2)

y is a quadratic function of x, so this a quadratic classification rule.



Caution: Quadratic Classification of Non-Normal Data

Figure: Figure 11.6 from Applied Multivariate Statistical Analysis, 6th Ed(Johnson & Wichern). Visualization is for p = 1 variable.Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 25

Evaluating Classification Functions

Evaluating ClassificationFunctions


Evaluating Classification Functions Misclassification Measures

Quantifying the Quality of a Classification Rule

To determine if a classification rule is “good” we can examine the errorrates, i.e., misclassification probabilities.

The population parameters are unknown in practice, so we focus onapproaches that can estimate the error rates from the observed data.

We want our classification rule to cross-validate to new data, so weconsider cross-validation procedures.



Total Probability of Misclassification

The Total Probability of Misclassification (TPM) is defined as

TPM(R1,R2) = p1

∫R2

f1(x)dx + p2

∫R1

f2(x)dx

for any classification rule (region) that partitions Ω = R1 ∪ R2.

The Optimum Error Rate (OER) is the minimum possible value of TPM

OER = minR1,R2

TPM(R1,R2) subject to Ω = R1 ∪ R2

which is obtained when R1 : f1(x)f2(x) ≥

p2p1

and R2 : f1(x)f2(x) <

p2p1

.

If c(1|2) = c(2|1), minimizing TPM is same as minimizing ECM



Actual Error Rate

The error rates on the previous slide require knowledge of the(typically unknown) parameters that define the densities f1(·) and f2(·).

Example: For LDA, calculating OER requires µ1, µ2, and Σ

The Actual Error Rate (AER) is defined using the sample estimates

AER(R1, R2) = p1

∫R2

f1(x)dx + p2

∫R1

f2(x)dx

where R1 and R2 denote estimates from samples sizes n1 and n2.



Apparent Error Rate

The Apparent Error Rate (APER) is an—optimistic—estimate of AER.Estimates the AER using the observed (training) sample of data

The confusion matrix for a sample of data is

Classified as:π1 π2

Truth:π1 nC1 nM1 n1π2 nM2 nC2 n2

wherenCk is the number correctly classified in population k ∈ 1,2nM1 = n1 − nC1 is the number from π1 that are misclassifiednM2 = n2 − nC2 is the number from π2 that are misclassified



Apparent Error Rate (continued)

Given a sample of data with confusion matrix

Classified as:π1 π2

Truth:π1 nC1 nM1 n1π2 nM2 nC2 n2

the APER is calculated as

APER =nM1 + nM2

n1 + n2

which is the total proportion of misclassified sample observations.



Leave-One-Out (Ordinary) Cross-Validation

Lachenbruch proposed a better approach to estimate the AER:1. Population 1 (for i = 1, . . . ,n1)

(a) Hold out the i-th observation from π1 and build classification rule(b) Use classification rule from Step 1(a) to classify the i-th observation

2. Population 2 (for i = 1, . . . ,n2)(a) Hold out the i-th observation from π2 and build classification rule(b) Use classification rule from Step 2(a) to classify the i-th observation

An (almost) unbiased estimate of the expected AER is given by

E(AER) =n∗M1 + n∗M2

n1 + n2

where n∗M1 and n∗M2 are the number of misclassified observations usingthe above “leave-one-out”’ procedure.


Evaluating Classification Functions Classification Quality in Linear Discriminant Analysis

Revisiting Linear Discriminant Analysis

Let X = (X1, . . . ,Xp)′ denote a random vector and letf1(x) ∼ N(µ1,Σ) denote the pdf for population π1

f2(x) ∼ N(µ2,Σ) denote the pdf for population π2

Reminder: assuming that(

c(1|2)c(2|1)

)(p2p1

)= 1, the classification rule is

R1 : Y ≥ mR2 : Y < m

whereY = a′X and m =

12

(µY1 + µY2)

with a′ = (µ1 − µ2)′Σ−1, µY1 = a′µ1, and µY2 = a′µ2



Revisiting Linear Discriminant Analysis (continued)

Y = a′X = (µ1 − µ2)′Σ−1X is a linear function of X , so . . .µY1 = a′µ1 = (µ1 − µ2)′Σ−1µ1

µY2 = a′µ2 = (µ1 − µ2)′Σ−1µ2

σ2Y = a′Σa = (µ1 − µ2)′Σ−1(µ1 − µ2) = ∆2

And since X is multivariate normal, we have that

Y ∼

N(µY1 ,∆2) if from π1

N(µY2 ,∆2) if from π2

i.e., Y is univariate normal with population dependent mean.



Visualizing Misclassification in LDA

Figure: Figure 11.7 from Applied Multivariate Statistical Analysis, 6th Ed(Johnson & Wichern).



Calculating Misclassification in LDA (classify π1 as π2)

Defining m = (1/2)(µ1 − µ2)′Σ−1(µ1 + µ2), we have that

P(misclassify π1 as π2) = P(X ∈ R2|π1) = P(2|1)

= P (Y < m)

= P

(Y − µY1

σY<

m − (µ1 − µ2)Σ−1µ1∆

)

= P(

Z <−(1/2)∆2

∆

)= Φ(−∆/2)

where Φ(·) denotes the CDF of the standard normal distribution.



Calculating Misclassification in LDA (classify π2 as π1)

Defining m = (1/2)(µ1 − µ2)′Σ−1(µ1 + µ2), we have that

P(misclassify π2 as π1) = P(X ∈ R1|π2) = P(1|2)

= P (Y ≥ m)

= P

(Y − µY2

σY≥ m − (µ1 − µ2)Σ−1µ2

∆

)

= P(

Z ≥ (1/2)∆2

∆

)= 1−Φ(∆/2) = Φ(−∆/2)

where Φ(·) denotes the CDF of the standard normal distribution.



Optimum Error Rate for Linear Discriminant Analysis

For the LDA classification rule, we have that

OER = minR1,R2

TPM(R1,R2)

=12

P(misclassify π1 as π2) +12

P(misclassify π2 as π1)

=12Φ(−∆/2) +

12

[1−Φ(∆/2)]

= Φ(−∆/2)

so the OER is a function of the ∆ effect size

∆ =

√(µ1 − µ2)′Σ−1(µ1 − µ2)

which is distance measure between µ1 and µ2.


Classifying g ≥ 2 Populations

Classifying g ≥ 2Populations


Classifying g ≥ 2 Populations Overview of Problem

The g Population Classification Problem

Let X = (X1, . . . ,Xp)′ denote a random vector and let fk (x) denote theprobability density function (pdf) for population πk for k ∈ 1, . . . ,g.

Problem: Given a realization X = x, we want to assign x to a πk .

We want to find some classification rule to determine whether arealization X = x should be assigned to population π1, π2, . . ., or πg .



Classification Rule with g ≥ 2 Populations

Let Ω denote the sample space, i.e., all possible values of x, andR1 ⊂ Ω is the subset of Ω for which we classify x as π1

R2 ⊂ Ω is the subset of Ω for which we classify x as π2...

Rg ⊂ Ω is the subset of Ω for which we classify x as πg

Ω = R1 ∪ R2 ∪ · · · ∪ Rg and Rk ∩ R` = ∅ for all k 6= `.The classification rule partitions the sample spaceThe classification regions are mutually exclusive



Visualizing a Classification Rule: g = 3 Populations

Figure: Figure 11.10 from Applied Multivariate Statistical Analysis, 6th Ed(Johnson & Wichern). Visualization is for p = 2 variables.


Classifying g ≥ 2 Populations Cost of Misclassification

Probability and Cost of Misclassification

The conditional probability P(`|k) of classifying an object as π` whenthe object really belongs to πk is given by

P(`|k) = P(X ∈ R`|πk ) =

∫R`

fk (x)dx

for all k 6= ` with k , ` ∈ 1, . . . ,g.

Note that P(k |k) = 1−∑

` 6=k P(`|k) by definition.

Let c(`|k) denote the cost of allocating an object to π` when the objectreally belongs to πk , and let pk denote the prior probability of πk .



Expected Cost of Misclassification (revisited)

The conditional expected cost of misclassifying an object from πk is

ECM(k) =∑6=k

P(`|k)c(`|k)

Incorporating the prior probabilities, the overall ECM is given by

ECM =

g∑k=1

pkECM(k) =

g∑k=1

pk

∑6=k

P(`|k)c(`|k)



Minimum ECM Classification Rule

The classification regions R1,R2, . . . ,Rg that minimize the ECM aredefined by allocating X = x to the population πk that minimizes∑

6=k

p`f`(x)c(k |`)

To understand the logic of the classification rule, suppose that we haveequal costs, i.e., c(`|k) = c(k |`) = 1 for all k , ` ∈ 1, . . . ,g

We allocate x to the population πk that minimizes∑

`6=k p`f`(x)

Minimizing∑

` 6=k p`f`(x) is the same as maximizing pk fk (x)

Allocate x to population πk if pk fk (x) > p`f`(x) for all ` 6= kThis is equivalent to maximizing the posterior probability P(πk |x)


Classifying g ≥ 2 Populations Fisher’s Discriminant Analysis

Overview of Fisher’s Approach

Fisher developed his discriminant analysis for g > 2 populations.

Idea: find a small number of linear combinations (e.g., a′1x, a′2x, a′3x)that best separate the groups.

Offers a simple and useful procedure for classification, which alsoprovides nice visualizations.

Plot the linear combinations to visualize the discriminants



Assumptions of Fisher’s Discriminant Analysis

Let X = (X1, . . . ,Xp)′ denote a random vector and let fk (x) ∼ (µk ,Σ)denote the pdf for population πk .

Note the homogeneity of covariance matrix assumptionDo not need the multivariate normality assumption

Let µ = 1g∑g

k=1 µk denote the mean of the combined populations, and

Bµ =

g∑k=1

(µk − µ)(µk − µ)′

denote “Between” sum-of-squares and crossproducts (SSCP) matrix.



Properties of a Linear Combination

Define new variable Y = a′X which has properties

E(Y |πk ) = a′E(X |πk ) = a′µk

V (Y |πk ) = a′V (X |πk )a = a′Σa

and note that the overall mean of Y has the form

µY =1g

g∑k=1

µYk =1g

g∑k=1

a′µk = a′µ



Between versus Within Group Variability

Form the ratio of the between group separation over the variance of Y :

F ∗ =

∑gk=1(µYk − µY )2

σ2Y

=

∑gk=1(a′µk − a′µ)2

a′Σa

=a′[∑g

k=1(µk − µ)(µk − µ)′]

aa′Σa

=a′Bµaa′Σa

Note that higher F ∗ values relate to more separation between groups.



Population Discriminants

The population k -th discriminant is the linear combination

Yk = a′kX

where ak is proportional to the k -th eigenvector of Σ−1Bµ.k = 1, . . . , s where s = min(g − 1,p)

The ak are scaled to make the Yk have unit variance, i.e., a′kΣak = 1.a′kΣa` = 0 for k 6= `

Note that this is only useful if we somehow know the true populationparameters µ1, . . . ,µg and Σ.



Sample Discriminants

The sample estimated “Between” and “Within” SSCP matrices are

B =

g∑k=1

(xk − x)(xk − x)′ and W =

g∑k=1

nk∑i=1

(xi(k)− xk )(xi(k)− xk )′

where xk = 1nk

∑nki=1 xi(k) and x = 1

g∑g

k=1 xk .

The sample k -th discriminant is the linear combination

Yk = a′kX

where ak is proportional to the k -th eigenvector of W−1B.

The ak are scaled to make the Yk have unit variance, i.e., a′kΣak = 1,where Σ = Sp = 1

n−g W with n =∑g

k=1 nk .Nathaniel E. Helwig (U of Minnesota) Discrimination and Classification Updated 14-Mar-2017 : Slide 51


Properties of Population Discriminants

Let Y = A′X where A = [a1, . . . ,as].Y = (Y1, . . . ,Ys)′ contains the s discriminantsColumns of A contain the linear combination weights

The mean of Y is given by

E(Y |πk ) = A′E(X |πk ) = A′µk = µkY

and the covariance matrix for Y is

Cov(Y ) = A′Cov(X |πk )A = A′ΣA = Is

because the discriminants have unit variance and are uncorrelated.Remember: a′kΣa` = δk` where δk` is Kronecker’s δ



Classifying New Objects with Discriminants

Given a realization X = x, define y = A′x and calculate the distancebetween the observed y = (y1, . . . , ys)′ and the k -th population mean:

Dk = (y− µkY )′(y− µkY ) =s∑`=1

(y` − µkY`)2 =

s∑`=1

[a′`(x− µk )]2

where µkY = A′µk and y` = a′`x and µkY`= a′`µk .

To build a distance using r ≤ s discriminants, use

D(r)k =

r∑`=1

(y` − µkY`)2 =

r∑`=1

[a′`(x− µk )]2

and classify x to the population πk that minimizes the distance D(r)k .



Classifying New Objects with Sample Discriminants

Given a realization X = x, define y = A′x and calculate the distancebetween the observed y = (y1, . . . , ys)′ and the k -th sample mean:

Dk = (y− µkY )′(y− µkY ) =s∑`=1

(y` − µkY`)2 =

s∑`=1

[a′`(x− xk )]2

where µkY = A′xk and y` = a′`x and µkY`= a′`xk .

To build a distance using r ≤ s discriminants, use

D(r)k =

r∑`=1

(y` − µkY`)2 =

r∑`=1

[a′`(x− xk )]2

and classify x to the population πk that minimizes the distance D(r)k .



Relation to MVN Classification Problem

Let X = (X1, . . . ,Xp)′ be a random vector and let fk (x) ∼ N(µk ,Σk )denote the pdf for population πk .

Assuming equal misclassification costs, we allocate X = x to thepopulation πk that minimizes

∑` 6=k p`f`(x)⇐⇒ maximizes pk fk (x).

Equivalent to allocating X = x to the population πk that maximizes

dQk (x) = Quadratic discriminant score

= −12

ln(|Σk |)−12

(x− µk )′Σ−1k (x− µk ) + ln(pk )

dLk (x) = Linear discriminant score

= µ′kΣ−1x− 1

2µ′kΣ

−1µk + ln(pk )

where dLk is used when Σk = Σ for all k ∈ 1, . . . ,g.



Relation to MVN Classification Problem (continued)

If we assume that pk = 1/g for all k ∈ 1, . . . ,g, then

dLk (x) = µ′kΣ

−1x− 12µ′kΣ

−1µk

Define the linear combination yj = a′jx, where aj = Σ−1/2vj with vj

denoting the j-th eigenvector of Bµ = Σ−1/2BµΣ−1/2. Then

Dk =

p∑j=1

(yj − µkYj )2 =

p∑j=1

[a′j(x− µk )]2 = (x− µk )′Σ−1(x− µk )

= −2dLk (x) + α

where α = x′Σ−1x is constant across populations.

If rank(Bµ) = r , allocating to the population πk that maximizes dLk (x) is

equivalent to allocating to the population πk that minimizes D(r)k .


Fisher’s Iris Data Example

Fisher’s Iris Data Example


Fisher’s Iris Data Example Data Overview

Fisher’s (or Anderson’s) Famous Iris Data

R. A. Fisher published the LDA approach in 1936 and used EdgarAnderson’s iris flower dataset as an example.

The dataset consists of measurements of p = 4 variables taken fromnk = 50 flowers randomly sampled from each of g = 3 species.

Variables: Sepal Length, Sepal Width, Petal Length, Petal WidthSpecies: setosa, versicolor, virginica

The goal was/is to build a linear discriminant function that bestclassifies a new flower into one of the three species.



Fisher’s Famous Iris Data in R

> head(iris)Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa2 4.9 3.0 1.4 0.2 setosa3 4.7 3.2 1.3 0.2 setosa4 4.6 3.1 1.5 0.2 setosa5 5.0 3.6 1.4 0.2 setosa6 5.4 3.9 1.7 0.4 setosa> colMeans(iris[iris$Species=="setosa",1:4])Sepal.Length Sepal.Width Petal.Length Petal.Width

5.006 3.428 1.462 0.246> colMeans(iris[iris$Species=="versicolor",1:4])Sepal.Length Sepal.Width Petal.Length Petal.Width

5.936 2.770 4.260 1.326> colMeans(iris[iris$Species=="virginica",1:4])Sepal.Length Sepal.Width Petal.Length Petal.Width

6.588 2.974 5.552 2.026> p <- 4L> g <- 3L



Make Pooled Covariance Matrix

# make pooled covariances matrix> Sp <- matrix(0, p, p)> nx <- rep(0, g)> lev <- levels(iris$Species)> for(k in 1:g)+ x <- iris[iris$Species==lev[k],1:p]+ nx[k] <- nrow(x)+ Sp <- Sp + cov(x) * (nx[k] - 1)+ > Sp <- Sp / (sum(nx) - g)> round(Sp, 3)

Sepal.Length Sepal.Width Petal.Length Petal.WidthSepal.Length 0.265 0.093 0.168 0.038Sepal.Width 0.093 0.115 0.055 0.033Petal.Length 0.168 0.055 0.185 0.043Petal.Width 0.038 0.033 0.043 0.042


Fisher’s Iris Data Example Linear Discriminant Analysis

LDA in R via the lda Function (MASS Package)

# fit lda model> library(MASS)> ldamod <- lda(Species ~ ., data=iris, prior=rep(1/3, 3))

# check the LDA coefficients/scalings> ldamod$scaling

LD1 LD2Sepal.Length 0.8293776 0.02410215Sepal.Width 1.5344731 2.16452123Petal.Length -2.2012117 -0.93192121Petal.Width -2.8104603 2.83918785> crossprod(ldamod$scaling, Sp) %*% ldamod$scaling

LD1 LD2LD1 1.00000e+00 -7.21645e-16LD2 -7.21645e-16 1.00000e+00

# create the (centered) discriminant scores> mu.k <- ldamod$means> mu <- colMeans(mu.k)> dscores <- scale(iris[,1:p], center=mu, scale=F) %*% ldamod$scaling> sum((dscores - predict(ldamod)$x)^2)[1] 1.658958e-28



Plot LDA Results: Score and Coefficients

−10 −5 0 5 10

−3

−2

−1

01

23

Discriminant Scores

LD1

LD2

setosaversicolorvirginica

−4 −3 −2 −1 0 1 2 3

−1

01

23

Discriminant Coefficients

LD1LD

2

Sepal.Length

Sepal.Width

Petal.Length

Petal.Width

R code for left plot:plot(dscores, xlab="LD1", ylab="LD2", pch=spid, col=spid,

main="Discriminant Scores", xlim=c(-10, 10), ylim=c(-3, 3))legend("top",lev,pch=1:3,col=1:3,bty="n")



Plot LDA Results: Discriminant Partitions

−5 0 5

−2

−1

01

2

LD1

LD2 s

s

s

s

s

s

ss

s s

s

s

ss

s

s

s

sss

s

ss

s

s

s

sss

ss

s

s

s

s

s

ss

s

s

s

s

s

ss

s

s

s

s

sc

c

c

c

c

c

c

c

c

c

c

c

c

c

ccc

cc c

c

c

c c

c

c

c

cc

c

cc

cc

c

c

c

c

c

c

c

c

c

c

cccc

cc

v

v

v

v

v

vv

v

v

v

v

v

v

v

v

v

v

v

v

v

v

v

v

v

v

vvv

v

vv

v

v

v

v

v

v

v

v

v

vv

v

v

v

v

v

v

v

v

app. error rate: 0.02

Partition Plot

library(klaR)species <- factor(rep(c("s","c","v"), each=50))partimat(x=dscores[,2:1], grouping=species, method="lda")



Plot LDA Results: All Pairwise Partitions

2.0 2.5 3.0 3.5 4.0

4.5

5.5

6.5

7.5

Sepal.Width

Sep

al.L

engt

h

ss

ss

s

s

s

s

s

s

s

ss

s

s ss

s

s

ss

s

s

ss

s sss

ss

ss

s

s s

s

s

s

s s

s s

s ss

s

s

ss

c

c

c

c

c

c

c

c

c

cc

cc c

c

c

cc

c

cc

cc

cc

cc c

cc

cccc

c

c

c

c

cc c

cc

c

c cc

c

c

c

v

v

v

vv

v

v

v

v

v

vv

v

v v

vv

vv

v

v

v

v

v

v

v

v vv

vv

v

vvv

v

vv

v

vvv

v

v vv

vv

vv


1 2 3 4 5 6

4.5

5.5

6.5

7.5

Petal.LengthS

epal

.Len

gth

ss

s s

s

s

s

s

s

s

s

ss

s

s sss

s

ss

s

s

ss

ssss

ss

ss

s

ss

s

s

s

ss

ss

s ss

s

s

ss

c

c

c

c

c

c

c

c

c

cc

cc c

c

c

cc

c

cc

cc

ccc

c c

cc

ccc

c

c

c

c

c

cc c

cc

c

ccc

c

c

c

v

v

v

vv

v

v

v

v

v

v v

v

vv

v v

v v

v

v

v

v

v

v

v

vvv

vv

v

vvv

v

vv

v

vv

v

v

vvv

vv

vv


1 2 3 4 5 6

2.0

2.5

3.0

3.5

4.0

Petal.Length

Sep

al.W

idth s

ss s

s

s

ss

ss

s

s

ss

s

s

s

s

ss

s

ss

s s

s

sssss

s

ss

ss

ss

s

ss

s

s

s

s

s

s

s

s

s cc c

c

cc

c

c

cc

c

c

c

cccc

c

c

c

c

c

c

ccc ccc

ccc

c c

c

c

c

c

c

c c

c

c

c

c

ccc

c

c

v

v

vv v v

v

v

v

v

v

v

v

v

v

vv

v

v

v

v

v vv

v v

vv

vv

v

v

vvv

v

v

vv v vv

v

vv

v

v

v

v

v


0.5 1.0 1.5 2.0 2.5

4.5

5.5

6.5

7.5

Petal.Width

Sep

al.L

engt

h

ssss

s

s

s

s

s

s

s

ss

s

s ss

s

s

ss

s

s

sss sss

ss

ss

s

ss

s

s

s

s s

ss

sss

s

s

ss

c

c

c

c

c

c

c

c

c

cc

cc c

c

c

cc

c

cc

cc

cc

cc c

cc

ccc

c

c

c

c

c

ccc

cc

c

cc c

c

c

c

v

v

v

vv

v

v

v

v

v

vv

v

v v

vv

v v

v

v

v

v

v

v

v

vvv

vv

v

vvv

v

vv

v

vv

v

v

v vv

vv

vv


0.5 1.0 1.5 2.0 2.5

2.0

2.5

3.0

3.5

4.0

Petal.Width

Sep

al.W

idth s

sss

s

s

ss

ss

s

s

ss

s

s

s

s

ss

s

ss

ss

s

sssss

s

s s

ss

ss

s

s s

s

s

s

s

s

s

s

s

s c cc

c

cc

c

c

cc

c

c

c

ccc c

c

c

c

c

c

c

c c cc

cc

ccc

c c

c

c

c

c

c

cc

c

c

c

c

c cc

c

c

v

v

vv vv

v

v

v

v

v

v

v

v

v

vv

v

v

v

v

vvv

vv

vv

vv

v

v

vvv

v

v

vv v vv

v

v v

v

v

v

v

v


0.5 1.0 1.5 2.0 2.5

12

34

56

Petal.Width

Pet

al.L

engt

h

ssssss

ssss ssss s

ssssss s

s

sss sssss ss sssss ss sss

ss

sssss

c cc

ccc c

c

c

cc

cc

c

c

c cc

cc

c

c

ccc c

c cc

ccc c

cc cc

ccc

c cc

c

cc cc

c

c

v

v

vv v

v

v

vv

v

vv vv vvv

v v

v

v

v

v

v

vv

vv

vvv

v

vv

vv

vv

vv v

vv

v vvv v v

v


Partition Plot

library(klaR)species <- factor(rep(c("s","c","v"), each=50))partimat(x=iris[,1:4], grouping=species, method="lda")



APER and Expected AER

# make confusion matrix (and APER)> confusion <- table(iris$Species, predict(ldamod)$class)> confusion

setosa versicolor virginicasetosa 50 0 0versicolor 0 48 2virginica 0 1 49

> n <- sum(confusion)> aper <- (n - sum(diag(confusion))) / n> aper[1] 0.02

# use CV to get expected AER> ldamodCV <- lda(Species ~ ., data=iris, prior=rep(1/3, 3), CV=TRUE)> confusionCV <- table(iris$Species, ldamodCV$class)> confusionCV


> eaer <- (n - sum(diag(confusionCV))) / n> eaer[1] 0.02



Split Data into Training (70%) and Testing (30%) Sets

> # split into separate matrices for each flower> Xs <- subset(iris, Species=="setosa")> Xc <- subset(iris, Species=="versicolor")> Xv <- subset(iris, Species=="virginica")

# split into training and testing> set.seed(1)> sid <- sample.int(n=50, size=35)> cid <- sample.int(n=50, size=35)> vid <- sample.int(n=50, size=35)> Xtrain <- rbind(Xs[sid,], Xc[cid,], Xv[vid,])> Xtest <- rbind(Xs[-sid,], Xc[-cid,], Xv[-vid,])

# fit lda to training and evaluate on testing> ldatrain <- lda(Species ~ ., data=Xtrain, prior=rep(1/3, 3))> confusionTest <- table(Xtest$Species, predict(ldatrain, newdata=Xtest)$class)> confusionTest


> n <- sum(confusionTest)> aer <- (n - sum(diag(confusionTest))) / n> aer[1] 0.02222222



Two-Fold CV with 100 Random 70/30 Splits

> nrep <- 100> aer <- rep(0, nrep)> set.seed(1)> for(k in 1:nrep)+ sid <- sample.int(n=50, size=35)+ cid <- sample.int(n=50, size=35)+ vid <- sample.int(n=50, size=35)+ Xtrain <- rbind(Xs[sid,], Xc[cid,], Xv[vid,])+ Xtest <- rbind(Xs[-sid,], Xc[-cid,], Xv[-vid,])+ ldatrain <- lda(Species ~ ., data=Xtrain, prior=rep(1/3, 3))+ confusionTest <- table(Xtest$Species, predict(ldatrain, newdata=Xtest)$class)+ confusionTest+ n <- sum(confusionTest)+ aer[k] <- (n - sum(diag(confusionTest))) / n+ > mean(aer)[1] 0.022


Fisher’s Iris Data Example Quadratic Discriminant Analysis

QDA in R via the qda Function (MASS Package)

# fit qda model> library(MASS)> qdamod <- qda(Species ~ ., data=iris, prior=rep(1/3, 3))> names(qdamod)[1] "prior" "counts" "means" "scaling" "ldet" "lev" "N"[8] "call" "terms" "xlevels"

# check the QDA coefficients/scalings> dim(qdamod$scaling)[1] 4 4 3> round(crossprod(qdamod$scaling[,,1], cov(Xs[,1:p])) %*% qdamod$scaling[,,1], 4)1 2 3 4

1 1 0 0 02 0 1 0 03 0 0 1 04 0 0 0 1> round(crossprod(qdamod$scaling[,,2], cov(Xc[,1:p])) %*% qdamod$scaling[,,2], 4)1 2 3 4

1 1 0 0 02 0 1 0 03 0 0 1 04 0 0 0 1> round(crossprod(qdamod$scaling[,,3], cov(Xv[,1:p])) %*% qdamod$scaling[,,3], 4)1 2 3 4

1 1 0 0 02 0 1 0 03 0 0 1 04 0 0 0 1



Plot QDA Results: All Pairwise Partitions

2.0 2.5 3.0 3.5 4.0

4.5

5.5

6.5

7.5

Sepal.Width

Sep

al.L

engt

h

ss

ss

s

s

s

s

s

s

s

ss

s

s ss

s

s

ss

s

s

ss

s sss

ss

ss

s

s s

s

s

s

s s

s s

s ss

s

s

ss

v

v

v

v

v

v

v

v

v

vv

vv v

v

v

vv

v

vv

vv

vv

vv v

vv

vvvv

v

v

v

v

vv v

vv

v

v vv

v

v

v

v

v

v

vv

v

v

v

v

v

vv

v

v v

vv

vv

v

v

v

v

v

v

v

v vv

vv

v

vvv

v

vv

v

vvv

v

v vv

vv

vv


1 2 3 4 5 6

4.5

5.5

6.5

7.5

Petal.LengthS

epal

.Len

gth

ss

s s

s

s

s

s

s

s

s

ss

s

s sss

s

ss

s

s

ss

ssss

ss

ss

s

ss

s

s

s

ss

ss

s ss

s

s

ss

v

v

v

v

v

v

v

v

v

vv

vv v

v

v

vv

v

vv

vv

vvv

v v

vv

vvv

v

v

v

v

v

vv v

vv

v

vvv

v

v

v

v

v

v

vv

v

v

v

v

v

v v

v

vv

v v

v v

v

v

v

v

v

v

v

vvv

vv

v

vvv

v

vv

v

vv

v

v

vvv

vv

vv


1 2 3 4 5 6

2.0

2.5

3.0

3.5

4.0

Petal.Length

Sep

al.W

idth s

ss s

s

s

ss

ss

s

s

ss

s

s

s

s

ss

s

ss

s s

s

sssss

s

ss

ss

ss

s

ss

s

s

s

s

s

s

s

s

s vv v

v

vv

v

v

vv

v

v

v

vvvv

v

v

v

v

v

v

vvv vvv

vvv

v v

v

v

v

v

v

v v

v

v

v

v

vvv

v

v

v

v

vv v v

v

v

v

v

v

v

v

v

v

vv

v

v

v

v

v vv

v v

vv

vv

v

v

vvv

v

v

vv v vv

v

vv

v

v

v

v

v


0.5 1.0 1.5 2.0 2.5

4.5

5.5

6.5

7.5

Petal.Width

Sep

al.L

engt

h

ssss

s

s

s

s

s

s

s

ss

s

s ss

s

s

ss

s

s

sss sss

ss

ss

s

ss

s

s

s

s s

ss

sss

s

s

ss

v

v

v

v

v

v

v

v

v

vv

vv v

v

v

vv

v

vv

vv

vv

vv v

vv

vvv

v

v

v

v

v

vvv

vv

v

vv v

v

v

v

v

v

v

vv

v

v

v

v

v

vv

v

v v

vv

v v

v

v

v

v

v

v

v

vvv

vv

v

vvv

v

vv

v

vv

v

v

v vv

vv

vv


0.5 1.0 1.5 2.0 2.5

2.0

2.5

3.0

3.5

4.0

Petal.Width

Sep

al.W

idth s

sss

s

s

ss

ss

s

s

ss

s

s

s

s

ss

s

ss

ss

s

sssss

s

s s

ss

ss

s

s s

s

s

s

s

s

s

s

s

s v vv

v

vv

v

v

vv

v

v

v

vvv v

v

v

v

v

v

v

v v vv

vv

vvv

v v

v

v

v

v

v

vv

v

v

v

v

v vv

v

v

v

v

vv vv

v

v

v

v

v

v

v

v

v

vv

v

v

v

v

vvv

vv

vv

vv

v

v

vvv

v

v

vv v vv

v

v v

v

v

v

v

v


0.5 1.0 1.5 2.0 2.5

12

34

56

Petal.Width

Pet

al.L

engt

h

ssssss

ssss ssss s

ssssss s

s

sss sssss ss sssss ss sss

ss

sssss

v vv

vvv v

v

v

vv

vv

v

v

v vv

vv

v

v

vvv v

v vv

vvv v

vv vv

vvv

v vv

v

vv vv

v

v

v

v

vv v

v

v

vv

v

vv vv vvv

v v

v

v

v

v

v

vv

vv

vvv

v

vv

vv

vv

vv v

vv

v vvv v v

v


Partition Plot

library(klaR)species <- factor(rep(c("s","c","v"), each=50))partimat(x=iris[,1:4], grouping=species, method="qda")



APER and Expected AER

# make confusion matrix (and APER)> confusion <- table(iris$Species, predict(qdamod)$class)> confusion


> n <- sum(confusion)> aper <- (n - sum(diag(confusion))) / n> aper[1] 0.02

# use CV to get expected AER> qdamodCV <- qda(Species ~ ., data=iris, prior=rep(1/3, 3), CV=TRUE)> confusionCV <- table(iris$Species, qdamodCV$class)> confusionCV


> eaer <- (n - sum(diag(confusionCV))) / n> eaer[1] 0.02666667



Split Data into Training (70%) and Testing (30%) Sets

> # split into separate matrices for each flower> Xs <- subset(iris, Species=="setosa")> Xc <- subset(iris, Species=="versicolor")> Xv <- subset(iris, Species=="virginica")

> # split into training and testing> set.seed(1)> sid <- sample.int(n=50, size=35)> cid <- sample.int(n=50, size=35)> vid <- sample.int(n=50, size=35)> Xtrain <- rbind(Xs[sid,], Xc[cid,], Xv[vid,])> Xtest <- rbind(Xs[-sid,], Xc[-cid,], Xv[-vid,])

# fit qda to training and evaluate on testing> qdatrain <- qda(Species ~ ., data=Xtrain, prior=rep(1/3, 3))> confusionTest <- table(Xtest$Species, predict(qdatrain, newdata=Xtest)$class)> confusionTest


> n <- sum(confusionTest)> aer <- (n - sum(diag(confusionTest))) / n> aer[1] 0.02222222



Two-Fold CV with 100 Random 70/30 Splits

> nrep <- 100> aer <- rep(0, nrep)> set.seed(1)> for(k in 1:nrep)+ sid <- sample.int(n=50, size=35)+ cid <- sample.int(n=50, size=35)+ vid <- sample.int(n=50, size=35)+ Xtrain <- rbind(Xs[sid,], Xc[cid,], Xv[vid,])+ Xtest <- rbind(Xs[-sid,], Xc[-cid,], Xv[-vid,])+ qdatrain <- qda(Species ~ ., data=Xtrain, prior=rep(1/3, 3))+ confusionTest <- table(Xtest$Species, predict(qdatrain, newdata=Xtest)$class)+ confusionTest+ n <- sum(confusionTest)+ aer[k] <- (n - sum(diag(confusionTest))) / n+ > mean(aer)[1] 0.02466667



Plot LDA and QDA Results using PCA

−4 −2 0 2 4

−2

−1

01

2

LDA Results

PC1

PC

2


−4 −2 0 2 4−

2−

10

12

QDA Results

PC1

PC

2




Plot LDA and QDA Results using PCA (R code)

R code for plot on previous slide:# visualize LDA and QDA results via PCAldaid <- as.integer(predict(ldamod)$class)qdaid <- as.integer(predict(qdamod)$class)pcamod <- princomp(iris[,1:4])dev.new(width=10, height=5, noRStudioGD=TRUE)par(mfrow=c(1,2))plot(pcamod$scores[,1:2], xlab="PC1", ylab="PC2", pch=ldaid, col=ldaid,

main="LDA Results", xlim=c(-4, 4), ylim=c(-2, 2))legend("topright",lev,pch=1:3,col=1:3,bty="n")abline(h=0,lty=3)abline(v=0,lty=3)plot(pcamod$scores[,1:2], xlab="PC1", ylab="PC2", pch=qdaid, col=qdaid,

main="QDA Results", xlim=c(-4, 4), ylim=c(-2, 2))legend("topright",lev,pch=1:3,col=1:3,bty="n")abline(h=0,lty=3)abline(v=0,lty=3)


Documents

Nathaniel E. Helwig