16
• This supervised learning technique uses Bayes’ rule but is different in philosophy from the well known work of Aitken, Taroni, et al. • Bayes’ rule: • Pr is probability • Equation means: “How does the probability of an item being a member of group change, given Bayesian Discriminant Analysis Prior probability This can be a problem

This supervised learning technique uses Bayes’ rule but is different in philosophy from the well known work of Aitken, Taroni, et al. Bayes’ rule: Pr is

Embed Size (px)

Citation preview

Page 1: This supervised learning technique uses Bayes’ rule but is different in philosophy from the well known work of Aitken, Taroni, et al. Bayes’ rule: Pr is

• This supervised learning technique uses Bayes’ rule but is different in philosophy from the well known work of Aitken, Taroni, et al.

• Bayes’ rule:

• Pr is probability

• Equation means: “How does the probability of an item being a member of group change, given evidence x”

Bayesian Discriminant Analysis

Prior probabilityThis can be a problem!

Page 2: This supervised learning technique uses Bayes’ rule but is different in philosophy from the well known work of Aitken, Taroni, et al. Bayes’ rule: Pr is

Bayesian Discriminant Analysis• Bayes’ rule can be turned into a classification

rule:

=> Choose group 1

*If priors are both 0.5, decision boundaries are

where curves cross

Page 3: This supervised learning technique uses Bayes’ rule but is different in philosophy from the well known work of Aitken, Taroni, et al. Bayes’ rule: Pr is

• If the data is multivariate normal drawn from the same population, the decision rule becomes:

Bayes-Gaussian Discriminant Analysis

slope intercept

with the “distance” defined as:

and

• Note that if the data is just 1D this is just an equation for a line:

Like an average cov mat

Page 4: This supervised learning technique uses Bayes’ rule but is different in philosophy from the well known work of Aitken, Taroni, et al. Bayes’ rule: Pr is

• If the data is multivariate normal but drawn from different populations, the decision rule is the same but the “decision distance” becomes:

Bayes-Gaussian Discriminant Analysis

b c

• Note that if the data is just 1D this is an equation for a parabola:

New quadratic term

a

Page 5: This supervised learning technique uses Bayes’ rule but is different in philosophy from the well known work of Aitken, Taroni, et al. Bayes’ rule: Pr is

• The “quadratic” version is always called quadratic discriminant analysis, QDA

• The “linear” version is called by a number of names!• linear discriminant analysis, LDA

• Some combination of of the above with the words, Gaussian or classification

• A number of techniques use the name LDA!• Important to specify the equations used to tell the

difference!

Bayes-Gaussian Discriminant Analysis

Page 6: This supervised learning technique uses Bayes’ rule but is different in philosophy from the well known work of Aitken, Taroni, et al. Bayes’ rule: Pr is

Bayes-Gaussian Discriminant Analysis

Groups have similar covariance structure:linear discriminant rule should work well

Groups have different covariance structure:quadratic discriminant rule may work better

Page 7: This supervised learning technique uses Bayes’ rule but is different in philosophy from the well known work of Aitken, Taroni, et al. Bayes’ rule: Pr is

• This supervised technique is called Linear Discriminant Analysis (LDA) in R• Also called Fisher linear discriminant analysis

• CVA is closely related to linear Bayes-Gaussian discriminant analysis

• Works on a principle similar to PCA: Look for “interesting directions in data space”• CVA: Find directions in space which best separate

groups.• Technically: find directions which maximize ratio of

between group to within variation

Canonical Variate Analysis

Page 8: This supervised learning technique uses Bayes’ rule but is different in philosophy from the well known work of Aitken, Taroni, et al. Bayes’ rule: Pr is

Canonical Variate Analysis

Project on PC1:Not necessarily good group separation!

Project on CV1:Good group separation!

Note: There are #groups -1 or p CVswhich ever is smaller

Page 9: This supervised learning technique uses Bayes’ rule but is different in philosophy from the well known work of Aitken, Taroni, et al. Bayes’ rule: Pr is

• Use between-group to within-group covariance matrix, W-1B to find directions of best group separation (CVA loadings, Acv):

• CVA can be used for dimension reduction.

• Caution! These “dimensions” are not at right angles (i.e. not orthogonal)

• CVA plots can thus be distorted from reality

• Always check loading angles!

• Caution! CVA will not work well with very correlated data

Canonical Variate Analysis

Page 10: This supervised learning technique uses Bayes’ rule but is different in philosophy from the well known work of Aitken, Taroni, et al. Bayes’ rule: Pr is

Canonical Variate Analysis

2D CVA of gasoline data set: 2D PCA of gasoline data set:

Page 11: This supervised learning technique uses Bayes’ rule but is different in philosophy from the well known work of Aitken, Taroni, et al. Bayes’ rule: Pr is

• Distance metric used in CVA to assign group i.d. of an unknown data point:

• If data is Gaussian and group covariance structures are the same then CVA classification is the same as Bayes-Gaussian classification.

Canonical Variate Analysis

Page 12: This supervised learning technique uses Bayes’ rule but is different in philosophy from the well known work of Aitken, Taroni, et al. Bayes’ rule: Pr is

• PLS-DA is a supervised discrimination technique and very popular in chemometrics• Works well with highly correlated variables (like in

spectroscopy)• Lots of correlation causes CVA to fail!

• Group labels coded into a “response matrix” Y• PLS searches for directions of maximum covariance in X and Y.

• Loading for X can be used like PCA loading• Dimension reduction

• Loading plots

Partial Least Squares Discriminant Analysis

Page 13: This supervised learning technique uses Bayes’ rule but is different in philosophy from the well known work of Aitken, Taroni, et al. Bayes’ rule: Pr is

Partial Least Squares Discriminant Analysis• PLS-DA theory: Find an (approximate) linear

relationship between experimental (explanatory) variables and group labels (response variables)• Y=XB+E

• X=TPT+EX

• Y=UQT+EY

• So substituting: UQT=TPTB+E

exp. vars.lbls.

“error” or “residuals” matrix

PLS-scores PLS-loadings

*Use these “Y-scores” with a “soft-max” or “Bayes” to pick “most-likely” group label

Page 14: This supervised learning technique uses Bayes’ rule but is different in philosophy from the well known work of Aitken, Taroni, et al. Bayes’ rule: Pr is

Partial Least Squares Discriminant Analysis• How do we solve this for T, P and U??• Objective: maximize covariance between X and Y scores, T

and U.

• Various procedure to do this:

• Kernel-PLS

• SIMPLS

• NIPLS

• Give close, but slightly different numerical results

• In R, functions are:

• plsr (pls package)

• spls (spls package)

• Easiest: plsda (caret pakage)

Page 15: This supervised learning technique uses Bayes’ rule but is different in philosophy from the well known work of Aitken, Taroni, et al. Bayes’ rule: Pr is

Partial Least Squares Discriminant Analysis

2D PLS of gasoline data set: 2D PCA of gasoline data set:

Page 16: This supervised learning technique uses Bayes’ rule but is different in philosophy from the well known work of Aitken, Taroni, et al. Bayes’ rule: Pr is

• Group assignments of observation vectors are made by interpreting Y scores.• Typically “soft-max” function is used.

Partial Least Squares Discriminant Analysis

Observation Vectors

Y-s

core

s