Dimensionality Reduction: Linear Discriminant Analysis and ... · Linear Discriminant Analysis...

Dimensionality Reduction:Linear Discriminant Analysis and

Principal Component Analysis

CMSC 678UMBC

Outline

Linear Algebra/Math Review

Two Methods of Dimensionality ReductionLinear Discriminant Analysis (LDA, LDiscA)Principal Component Analysis (PCA)

Covariance

covariance: how (linearly) correlated are variables

Value of variable jin object k

Mean ofvariable j

Value of variable iin object k

Mean ofvariable i

covariance ofvariables i and j

𝜎"# =1

𝑁 − 1()*+

(𝑥)" − 𝜇")(𝑥)# − 𝜇#)

Covariance

covariance: how (linearly) correlated are variables

Value of variable jin object k

Mean ofvariable j

Value of variable iin object k

Mean ofvariable i

covariance ofvariables i and j

𝜎"# =1

𝑁 − 1()*+

(𝑥)" − 𝜇")(𝑥)# − 𝜇#)

𝜎"# = 𝜎#" Σ =𝜎++ ⋯ 𝜎+3⋮ ⋱ ⋮𝜎3+ ⋯ 𝜎33

Eigenvalues and Eigenvectors

𝐴𝑥 = 𝜆𝑥matrix

vector

scalar

for a given matrix operation (multiplication):

what non-zero vector(s) change linearly? (by a single multiplication)

vector

scalar

𝐴 = 1 50 1

vector

scalar

𝐴 = 1 50 1

1 50 1

𝑥𝑦 = 𝑥 + 5𝑦

𝑥 + 5𝑦𝑦 = 𝜆

𝑥𝑦

vector

scalar

𝐴 = 1 50 1

only non-zero vector to scale

1 50 1

10 = 1 1

𝑥 + 5𝑦𝑦 = 𝜆

𝑥𝑦

Outline

Dimensionality Reduction

Original (lightly preprocessed data)

Compressed representation

N instances

D input features

L reduced features

Dimensionality Reduction

clarity of representation vs. ease of understanding

oversimplification: loss of important or relevant information

Courtesy Antano Žilinsko

Why “maximize” the variance?

How can we efficiently summarize? We maximize the variance within our summarization

We don’t increase the variance in the dataset

How can we capture the most information with the fewest number of axes?

Summarizing Redundant Information

(2,-1)(-2,-1)

(2,1) = 2*(1,0) + 1*(0,1)

(2,-1)(-2,-1)

(2,1) = 1*(2,1) + 0*(2,-1)(4,2) = 2*(2,1) + 0*(2,-1)

(2,-1)(-2,-1)

(2,1) = 1*(2,1) + 0*(2,-1)(4,2) = 2*(2,1) + 0*(2,-1)

(Is it the most general? These vectors aren’t orthogonal)

Outline

Linear Discriminant Analysis (LDA, LDiscA) and Principal Component Analysis (PCA)

Summarize D-dimensional input data by uncorrelated axes

Uncorrelated axes are also called principal components

Use the first L components to account for as much variance as possible

Geometric Rationale of LDiscA & PCA

Objective: to rigidly rotate the axes of the D-dimensional space to new positions (principal axes):

ordered such that principal axis 1 has the highest variance, axis 2 has the next highest variance, .... , and axis D has the lowest variance

covariance among each pair of the principal axes is zero (the principal axes are uncorrelated)

Remember: MAP Classifiers are Optimal for Classification

min𝐰("

𝔼ABC[ℓF/+(𝑦, I𝑦")] → max

𝐰("

𝑝 O𝑦" = 𝑦" 𝑥"

𝑝 O𝑦" = 𝑦" 𝑥" ∝ 𝑝 𝑥" O𝑦" 𝑝(O𝑦")posterior class-conditional

likelihood class prior

𝑥" ∈ ℝS

Linear Discriminant Analysis

MAP Classifier where:

1. class-conditional likelihoods are Gaussian

2. common covariance among class likelihoods

LDiscA: (1) What if likelihoods are Gaussian

𝑝 O𝑦" = 𝑦" 𝑥" ∝ 𝑝 𝑥" O𝑦" 𝑝(O𝑦")

𝑝 𝑥" 𝑘 = 𝒩 𝜇), Σ)

=exp −12 𝑥" − 𝜇) YΣ)Z+ 𝑥" − 𝜇)

2𝜋 S/\ Σ) +/\

https://upload.wikimedia.org/wikipedia/commons/5/57/Multivariate_Gaussian.png

https://en.wikipedia.org/wiki/Normal_distribution

LDiscA: (2) Shared Covariance

log𝑝 O𝑦" = 𝑘 𝑥"𝑝 O𝑦" = 𝑙 𝑥"

= log𝑝(𝑥"|𝑘)𝑝(𝑥"|𝑙)

+ log𝑝(𝑘)𝑝 𝑙

= log𝑝(𝑘)𝑝 𝑙 + log

exp −12 𝑥" − 𝜇) YΣ)Z+ 𝑥" − 𝜇)

2𝜋 S/\ Σ) +/\

exp −12 𝑥" − 𝜇b YΣbZ+ 𝑥" − 𝜇b

2𝜋 S/\ Σb +/\

= log𝑝(𝑘)𝑝 𝑙 + log

exp −12 𝑥" − 𝜇) YΣZ+ 𝑥" − 𝜇)

2𝜋 S/\ Σ) +/\

exp −12 𝑥" − 𝜇b YΣZ+ 𝑥" − 𝜇b

2𝜋 S/\ Σb +/\

Σb = Σ)

= log𝑝(𝑘)𝑝 𝑙 −

12 𝜇) − 𝜇b YΣZ+ 𝜇) − 𝜇b + 𝑥"YΣZ+(𝜇) − 𝜇b)

linear in xi(check for yourself: why did the

quadratic xi terms cancel?)

= log𝑝(𝑘)𝑝 𝑙 −

12 𝜇) − 𝜇b YΣZ+ 𝜇) − 𝜇b + 𝑥"YΣZ+(𝜇) − 𝜇b)

linear in xi(check for yourself: why did the

quadratic xi terms cancel?)

= 𝑥"YΣZ+𝜇) −12𝜇)

YΣZ+𝜇) + log 𝑝(𝑘)

+𝑥"YΣZ+𝜇b −12𝜇b

YΣZ+𝜇b + log 𝑝 𝑙

rewrite only in terms of xi(data) and single-class terms

Classify via Linear Discriminant Functions

𝛿) 𝑥" = 𝑥"YΣZ+𝜇) −12𝜇)

YΣZ+𝜇) + log 𝑝(𝑘)

argmax𝑘 𝛿) 𝑥" MAP classifierequivalent

LDiscA

Parameters to learn: 𝑝 𝑘 ), 𝜇) ), Σ

𝑝 𝑘 ∝ 𝑁)

number of items labeled with class k

LDiscA

𝑝 𝑘 ∝ 𝑁) 𝜇) =1𝑁)

(":BC*)

LDiscA

𝑝 𝑘 ∝ 𝑁) 𝜇) =1𝑁)

(":BC*)

𝑁 −𝐾()

scatter) =1

𝑁 −𝐾()

(":BC*)

𝑥" − 𝜇) 𝑥" − 𝜇) Y

within-class covarianceone option for 𝛴

Computational Steps for Full-Dimensional LDiscA

1. Compute means, priors, and covariance

1. Compute means, priors, and covariance2. Diagonalize covariance

Σ = UDUm

diagonal matrix of eigenvalues

K x K orthonormal matrix (eigenvectors)

Eigen decomposition

3. Sphere the data

Σ = UDUm

X∗ = 𝐷Z+\ 𝑈Y𝑋

3. Sphere the data (get unit covariance)

4. Classify according to linear discriminant functions 𝛿)(𝑥"∗)

Σ = UDUm

X∗ = 𝐷Z+\ 𝑈Y𝑋

Two Extensions to LDiscAQuadratic Discriminant Analysis (QDA)

Keep separate covariances per class

𝛿) 𝑥" =

−12 𝑥" − 𝜇) mΣsZ+(𝑥" − 𝜇))

+ log 𝑝 𝑘 −log |Σ)|2

Two Extensions to LDiscAQuadratic Discriminant Analysis (QDA)

Keep separate covariances per class

𝛿) 𝑥" =

−12 𝑥" − 𝜇) mΣsZ+(𝑥" − 𝜇))

+ log 𝑝 𝑘 −log |Σ)|2

Regularized LDiscA

Interpolate between shared covariance estimate (LDiscA) and class-specific estimate (QDA)

Σ) 𝛼 = 𝛼Σ) + 1 − 𝛼 Σ

Vowel Classification

LDiscA (left) vs. QDA (right)

ESL 4.3

Vowel Classification

LDiscA (left) vs. QDA (right)

Regularized LDiscA

ESL 4.3

Σ) 𝛼 = 𝛼Σ) + 1 − 𝛼 Σ

Supervised learning: learning with a teacherYou had training data which was (feature, label) pairs and the goal was to learn a mapping from features to labels

Supervised à Unsupervised

Supervised learning: learning with a teacherYou had training data which was (feature, label) pairs and the goal was to learn a mapping from features to labels

Unsupervised learning: learning without a teacherOnly features and no labels

Why is unsupervised learning useful?Visualization — dimensionality reduction

lower dimensional features might help learning

Discover hidden structures in the data: clustering

Supervised à Unsupervised

Outline

Geometric Rationale of LDiscA & PCA

Objective: to rigidly rotate the axes of the D-dimensional space to new positions (principal axes):

ordered such that principal axis 1 has the highest variance, axis 2 has the next highest variance, .... , and axis D has the lowest variance

covariance among each pair of the principal axes is zero (the principal axes are uncorrelated)

Adapted from Antano Žilinsko

LDiscA: but

now without

the labels!

L-Dimensional PCA

1. Compute mean 𝜇, priors, and common covariance Σ

2. Sphere the data (zero-mean, unit covariance) 3. Compute the (top L) eigenvectors, from

sphere-d data, via V

4. Project the data

Σ =1𝑁("

𝑥" − 𝜇 𝑥" − 𝜇 Y

𝑋∗ = 𝑉𝐷v𝑉Y

𝜇 =1𝑁("

0 2 4 6 8 10 12 14 16 18 20

Variable X1

2D Example of PCAvariables X1 and X2 have positive covariance & each has a similar variance

35.81 =X

91.42 =X

-8 -6 -4 -2 0 2 4 6 8 10 12

Variable X1

Configuration is Centeredsubtract the component-wise mean

-8 -6 -4 -2 0 2 4 6 8 10 12

Compute Principal ComponentsPC 1 has the highest possible variance (9.88)PC 2 has a variance of 3.03PC 1 and PC 2 have zero covariance.

-8 -6 -4 -2 0 2 4 6 8 10 12

Variable X1

-8 -6 -4 -2 0 2 4 6 8 10 12

Variable X1

PC axes are a rigid rotation of the original variablesPC 1 is simultaneously the direction of maximum variance and a least-squares “line of best fit” (squared distances of points away from PC 1 are minimized).

Generalization to p-dimensions

if we take the first k principal components, they define the k-dimensional “hyperplane of best fit” to the point cloud

of the total variance of all p variables:

PCs 1 to k represent the maximum possible proportion of that variance that can be displayed in k dimensions

How many axes are needed?

does the (k+1)th principal axis represent more variance than would be expected by chance?

a common “rule of thumb” when PCA is based on correlations is that axes with eigenvalues > 1

are worth interpreting

PCA as Reconstruction Error

𝑋 − 𝑍𝑈Y \ =𝑍 = 𝑋𝑈

NxD DxL

𝑋 − 𝑋𝑈𝑈Y \ =NxD DxL

𝑋 − 𝑋𝑈𝑈Y \ =

minw2 𝑋 \ − 2𝑈Y𝑋Y𝑋𝑈 =

NxD DxL

𝑋 − 𝑋𝑈𝑈Y \ =

minw2 𝑋 \ − 2𝑈Y𝑋Y𝑋𝑈 =

minw𝐶 − 2 𝑋𝑈 \

maximizing variance ↔ minimizing reconstruction error

NxD DxL

Slides Credit

https://www.mii.lt/zilinskas/uploads/visualization/lectures/lect4/lect4_pca/PCA1.ppt

Dimensionality Reduction: Linear Discriminant Analysis and ... · Linear Discriminant Analysis...

Documents

Linear discriminant analysis: A detailed tutorialscholar.cu.edu.eg/?q=abo/files/linear_discreminate_analysisp_detaile… · Linear discriminant analysis: A detailed tutorial Alaa

Principal Component Analysis and Linear Discriminant Analysis for Feature Reduction

Linear Discriminant Analysis(LDA)syllabus.cs.manchester.ac.uk/pgt/COMP61021/lectures/LDA.pdf• Case Study: PCA vs. LDA • Relevant Issues ... • Linear discriminant analysis (LDA)

Mining the Mind: Linear Discriminant Analysis of MEG

Linear Discriminant Analysis (LDA) for selection cuts :

Generalizing Linear Discriminant Analysis. Linear Discriminant Analysis Objective -Project a feature space (a dataset n-dimensional samples) onto a smaller

Discriminant Analysis - Statpower Slides/Discriminant Analysis.pdf · standard 2-group linear discriminant analysis the way Morrison reports it, and the way SPSS reports ... analysis

Linear discriminant analysis, two classes Linear ...research.cs.tamu.edu/prism/lectures/pr/pr_l10.pdf · Linear discriminant analysis, ... –LDA can be derived as the Maximum Likelihood

Fast Linear Discriminant Analysis using QR Decomposition ...hpark/papers/reglda_ieee.pdf1 Fast Linear Discriminant Analysis using QR Decomposition and Regularization Haesun Park, Barry

Chapter 5: Linear Discriminant Functions Introduction Linear Discriminant Functions and Decisions Surfaces Generalized Linear Discriminant Functions

Lecture 3. Linear Models for Classification. Outline General framework of Classification Discriminant Analysis Linear discriminant analysis, quadratic

Feature extraction using fuzzy complete linear discriminant analysis

Linear discriminant analysis, two classes Linear discriminant

COMPARING DISCRIMINANT ANALYSIS AND LINEAR …

Probabilistic Linear Discriminant Analysis (PLDA) with Bottleneck …llu/pdf/IS2014_lianglu.pdf · 2016-08-31 · Probabilistic Linear Discriminant Analysis (PLDA) with Bottleneck

Linear Discriminant Analysis (LDA)

Linear discriminant analysis for the small sample size

Robust Linear Discriminant Analysis Using Ratio

Use of Linear Discriminant Analysis in Song Classification

High-dimensional Linear Discriminant Analysis Classi er