47
Linear Methods for Classification 1 Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Linear Methods for Classification

1

Reference: The Elements of Statistical Learning,by T. Hastie, R. Tibshirani, J. Friedman, Springer

Page 2: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Introduction

• Suppose there are  classes, labeled 1,2, ,• A class of methods that model 

for each class. Then, classify  to the class with the largest value for its discriminant function

• Decision boundary between class  and  is that set of points for which  ℓ

2

Page 3: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Linear Discriminant Analysis

• Suppose  is the class‐conditional density of in class  , i.e., 

• Let  be the prior probability of class  , with 

• A simple application of Bayes theorem gives: 

∑ ℓ ℓℓ

3

Page 4: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Linear Discriminant Analysis

• Recall:  ∑ ℓ ℓℓ

• Many techniques are based on models for the class densities:• linear discriminant analysis (also quadratic 

discriminant)• mixture of Gaussians• nonparametric density estimates• Naive Bayes models

4

Page 5: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Linear Discriminant Analysis

• Suppose that we model each class density as multivariate Gaussian 

/ /∑

• Linear Discriminant Analysis (LDA) arises in the special case when the classes have a common covariance matrix 

5

Page 6: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Linear Discriminant Analysis• Comparing two classes  and  , sufficient to look 

at the log‐ratio:

ℓ ℓ

ℓℓ ℓ

(Note that the quadratic term for  can becanceled out, and  )

• One important outcome is that the log‐ratio is an equation linear in 

6

Page 7: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Linear Discriminant Analysis

• Linear log‐odds function implies the decision boundary between classes and  ‐the set where 

is linear in  ; in  dimensions  a hyperplane• True for any pair of classes  all decision 

boundaries are linear• Divide  into regions classified as class 1, class 

2, etc. regions separated by hyperplanes

7

Page 8: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Linear Discriminant Analysis• Regions separated by hyperplanes• An idealized example with three classes 

and 

8

Page 9: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Linear Discriminant Analysis• From the previous equation comparing two 

classes, without loss of generality, we can select one of the classes as the class 

• Then, the linear discriminant function for the class  :

• Equivalent description of the decision rule, with

9

Page 10: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Linear Discriminant Analysis

• In practice do not know the parameters of the Gaussian distributions

• Estimate using training data:• , where  is the number of class‐

observations•

10

Page 11: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Linear Discriminant AnalysisExample

11

Page 12: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Quadratic Discriminant Analysis

• If  are not assumed to be equal  convenient cancellations do not occur

• The pieces quadratic in  remain• Quadratic discriminant functions (QDA):

• The decision boundary between each pair of classes  and  is described by a quadratic equation  ℓ 12

Page 13: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Example of three classes are Gaussian mixtures, and decision boundaries are approximated by quadratic equations in 

13

Quadratic Discriminant AnalysisExample

Page 14: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Linear / Quadratic Discriminant Analysis• The estimates for QDA are similar to LDA, except 

the separate covariance matrices must be estimated for each classes

• When  is large = a dramatic increase in parameters• For LDA, 

• there are  1 1 parameters since we only need the differences  between the discriminant functionswhere  is some pre‐chosen class (here we have chosen the last)

• each difference requires  1 parameters

14

Page 15: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

• For QDA, • there are 

parameters• In the STATLOG project, LDA was among the top 

three classifiers for 7 of the 22 datasets QDA among the top three for four datasets, one of the pair were in the top three for 10 datasets

• Both techniques are widely used

15

Linear / Quadratic Discriminant Analysis

Page 16: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Regularized Discriminant Analysis• Compromise between LDA and QDA allows one to shrink the separate covariancesof QDA toward a common covariance as in LDA

• Regularized covariance matrices have the form:

where  is the pooled covariance matrix as used in LDA

• allows a continuum of models between LDA and QDA, needs to be specified

• can be chosen based on performance of the model on validation data, or by cross‐validation

16

Page 17: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Regularized Discriminant Analysis• The results of RDA applied 

to the vowel data• Both training and test error 

improved with increasing • Test error increases sharply 

after  0.9

• Large discrepancy between the training and test error is partly due to:• Many repeat measurements on small number of 

individuals, different in the training and test set17

Page 18: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Logistic Regression• Desire to model the posterior probabilities of the 

classes via linear functions in  (p‐dimensional vector)

• Ensuring they sum to one and remain in • Model:

18

Page 19: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Logistic Regression• Specified in terms of  log‐odds or logit

transformations• Choice of denominator is arbitrary – estimates are 

equivariant under this choice

ℓ ℓℓ

ℓ ℓℓ Sum to 1

19

Page 20: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Logistic RegressionTwo‐class Classification

• For two‐class classification, we can model two classes as 0 and 1.

• Treating the class 1 as the concept of interest, the posterior probability can be regarded as the class membership probability:

Pr 1exp

1 exp

• As a result, it maps  in p‐dimensional space to a value in [0,1]

20

logistic function

Page 21: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Logistic RegressionTwo‐Class Cases and Shape of Sigmoid Curve

• Consider 1‐dimensional 

21

Pr

Page 22: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Logistic RegressionAn Example of One‐dimension

• We wish to predict death from baseline APACHE II score of patients.

• Let Pr be the probability that a patient with score  will die.

22

• Note that linear regression would not work well since it could produce probabilities less than 0 or greater than 1

Page 23: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Logistic RegressionAn Example of One‐dimension

• Data that has a sharp survival cut off point between patients who live or die will lead to a large value of 

23

Page 24: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Logistic RegressionAn Example of One‐dimension

• One the other hand, if the data has a lengthy transition from survival to death, it will lead to a low value of 

24

Page 25: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Logistic RegressionModel Fitting for General Cases (K classes, p Dimension)

• Logistic regression models fit by maximum likelihood‐ using the conditional likelihood of  given 

• completely specifies the conditional distribution the multinomial distribution is appropriate

25

Page 26: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Logistic RegressionModel Fitting for General Cases (K classes, p Dimension)

• Let entire parameter set be

, then

• Log‐likelihood for  observations of input data and class labels:

where 

26

Page 27: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Logistic RegressionModel Fitting for Two‐class CasesTwo‐class cases• Convenient to code the two‐class  via a 0/1 response 

‐ where  1 when  1, and  0 when  2• Let  ; ; , and  ; 1 ;• Log‐likelihood:

ℓ log ; 1 log 1 ;

log 1

27

Page 28: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Logistic RegressionModel Fitting• Here:

• Assume vector of inputs  includes the constant term 1 to accommodate the intercept

• Maximize log‐likelihood  set its derivatives to zero

• Scores equations:

involves  equations nonlinear in 28

Page 29: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Logistic RegressionNewton Method for Optimization

• Let’s consider a function of one scalar variable  . The second order Taylor expansion around  :

Δ Δ12Δ

• We want to find the global minimum  ∗

• Near the minimum we could make a Taylor expansion:

∗ 12

• Newton method uses this fact, and minimizes a quadratic approximation to the function. 

29

Page 30: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Logistic RegressionNewton Method for Optimization

• Guess an initial point  . We can take a second order Taylor expansion around  and it will still be accurate:

12

Take the derivative with respect to  and set it equal to 0122 0

0

30

Page 31: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Logistic RegressionNewton Method for Optimization

• We just take the derivative with respect to  and set it equal to zero at a point we will call 

• We can iterate this procedure, minimizing one approximation and then using that to get a new approximation:

31

Page 32: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Logistic RegressionModel Fitting• Returning to the model fitting for the two‐class 

case, the log‐likelihood is:

• Starting with  , a single Newton update is:

32

Page 33: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Logistic RegressionModel Fitting• Let

denote the vector of  valuesthe  matrix of  valuesthe vector of fitted probabilities with  th

element a  diagonal matrix of weights with  th

diagonal element 

33

Page 34: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Logistic RegressionModel Fitting

• ℓ

• Newton Step:

34

Page 35: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Logistic RegressionExample• The subset of the Coronary Risk‐Factor Study 

(CORIS) baseline survey, carried out in three rural areas of the Western Cape, South Africa

• Aim: establish the intensity of ischemic heart disease risk factors in that high‐incidence region

• Response variable is the presence or absence of myocardial infraction (MI) at the time of survey

• 160 cases in data set, sample of 302 controls

35

Page 36: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Logistic RegressionExample

36

Page 37: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Logistic RegressionExample

• Fit a logistic‐regression model by maximum likelihood, giving the results shown in the next slide• z scores for each coefficients in the 

model (coefficients divided by their standard errors)

37

Page 38: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Logistic RegressionExample• Results from a logistic regression fit to the South 

African heart disease data:

38

Coefficient Std. Error Z Score

(Intercept) ‐4.130 0.964 ‐4.285

sbp 0.006 0.006 1.023

tobacco 0.080 0.026 3.034

ldl 0.185 0.057 3.219

famhist 0.939 0.225 4.178

obesity ‐0.035 0.029 ‐1.187

alcohol 0.001 0.004 0.136

age 0.043 0.010 4.184

Page 39: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Logistic RegressionExample• z scores greater than approximately 2 in absolute 

value is significant at the 5% level• Some surprises in the table of coefficients

• sbp and obesity appear to be not significant• On their own, both sbp and obesity are 

significant, with positive sign• Presence of many other correlated variables no longer needed (can even get a negative sign)

39

Page 40: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

L1 Regularized Logistic Regression

• The  penalty used in the Lasso can be used for variable and shrinkage with any linear regression model. 

• For logistic regression, we would maximize the log‐likelihood:

max,

log 1

• As with the Lasso, we typically do not penalize the intercept term, and standardize the predictors

40

Page 41: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Logistic Regression or LDA?

• For Linear Discriminant Analysis: log‐posterior odds between class  and  are linear functions of  :

41

Page 42: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Logistic Regression or LDA?

• The linear logistic regression model by construction has linear logits:

• It seems that linear discriminant analysis model and the linear logistic model are the same

• Although they have exactly the same form,• Different linear coefficients are estimated

• Logistic regression is more general due to less assumptions

42

Page 43: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Logistic Regression or LDA?

• Joint density of  and  :

denotes marginal density of inputs • For both LDA and logistic regression, the second 

term on the right has the logit‐linear form:

ℓ ℓℓ

43

Page 44: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Logistic Regression or LDA?• Logistic regression model leaves the marginal 

density of  as an arbitrary density function 

• Fits the parameters of  by maximizing the conditional likelihood – the multinomial likelihood with probabilities the 

• Although  is ignored, we can think of this marginal density as being estimated in a fully nonparametric and unrestricted fashion

• Using the empirical distribution function which places mass  at each observation

44

Page 45: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Logistic Regression or LDA?• LDA fits parameters by maximizing the full log‐

likelihood, based on joint density:

where  is the Gaussian density function• Standard normal theory leads easily to the 

estimates  ,  and • Marginal density  does play a role. It is a 

mixture density:

which also involves the parameters. 45

Page 46: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Logistic Regression or LDA?

• For LDA, relying on additional model assumptions• more information about parameters• estimate them more efficiently (lower 

variance)• If in fact the true  are Gaussian

• in worst case: ignoring this marginal part of likelihood constitutes a lose of efficiency of about 30% asymptotically in error rate• With 30% more data, the conditional 

likelihood will do as well.46

Page 47: Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Logistic Regression or LDA?

• For LDA, observations far from the decision boundary (down‐weighted by logistic regression) play a role in estimating the common covariance matrix• Not all good news: means that LDA is not 

robust to gross outliers• The marginal likelihood can be thought of as a 

regularizer• In practice, it is generally felt that logistic 

regression is a safer, more robust bet than LDA, relying on fewer assumptions.

47