Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior

Linear Methods for Classification

1

Reference: The Elements of Statistical Learning,by T. Hastie, R. Tibshirani, J. Friedman, Springer

Introduction

• Suppose there are classes, labeled 1,2, ,• A class of methods that model

for each class. Then, classify to the class with the largest value for its discriminant function

• Decision boundary between class and is that set of points for which ℓ

2

Linear Discriminant Analysis

• Suppose is the class‐conditional density of in class , i.e.,

• Let be the prior probability of class , with

• A simple application of Bayes theorem gives:

∑ ℓ ℓℓ

3


• Recall: ∑ ℓ ℓℓ

• Many techniques are based on models for the class densities:• linear discriminant analysis (also quadratic

discriminant)• mixture of Gaussians• nonparametric density estimates• Naive Bayes models

4


• Suppose that we model each class density as multivariate Gaussian

/ /∑

• Linear Discriminant Analysis (LDA) arises in the special case when the classes have a common covariance matrix

5

Linear Discriminant Analysis• Comparing two classes and , sufficient to look

at the log‐ratio:

ℓ ℓ

ℓℓ ℓ

ℓ

(Note that the quadratic term for can becanceled out, and )

• One important outcome is that the log‐ratio is an equation linear in

6


• Linear log‐odds function implies the decision boundary between classes and ‐the set where

is linear in ; in dimensions a hyperplane• True for any pair of classes all decision

boundaries are linear• Divide into regions classified as class 1, class

2, etc. regions separated by hyperplanes

7

Linear Discriminant Analysis• Regions separated by hyperplanes• An idealized example with three classes

and

8

Linear Discriminant Analysis• From the previous equation comparing two

classes, without loss of generality, we can select one of the classes as the class

• Then, the linear discriminant function for the class :

• Equivalent description of the decision rule, with

9


• In practice do not know the parameters of the Gaussian distributions

• Estimate using training data:• , where is the number of class‐

observations•

•

10

Linear Discriminant AnalysisExample

11

Quadratic Discriminant Analysis

• If are not assumed to be equal convenient cancellations do not occur

• The pieces quadratic in remain• Quadratic discriminant functions (QDA):

• The decision boundary between each pair of classes and is described by a quadratic equation ℓ 12

Example of three classes are Gaussian mixtures, and decision boundaries are approximated by quadratic equations in

13

Quadratic Discriminant AnalysisExample

Linear / Quadratic Discriminant Analysis• The estimates for QDA are similar to LDA, except

the separate covariance matrices must be estimated for each classes

• When is large = a dramatic increase in parameters• For LDA,

• there are 1 1 parameters since we only need the differences between the discriminant functionswhere is some pre‐chosen class (here we have chosen the last)

• each difference requires 1 parameters

14

• For QDA, • there are

parameters• In the STATLOG project, LDA was among the top

three classifiers for 7 of the 22 datasets QDA among the top three for four datasets, one of the pair were in the top three for 10 datasets

• Both techniques are widely used

15

Linear / Quadratic Discriminant Analysis

Regularized Discriminant Analysis• Compromise between LDA and QDA allows one to shrink the separate covariancesof QDA toward a common covariance as in LDA

• Regularized covariance matrices have the form:

where is the pooled covariance matrix as used in LDA

• allows a continuum of models between LDA and QDA, needs to be specified

• can be chosen based on performance of the model on validation data, or by cross‐validation

16

Regularized Discriminant Analysis• The results of RDA applied

to the vowel data• Both training and test error

improved with increasing • Test error increases sharply

after 0.9

• Large discrepancy between the training and test error is partly due to:• Many repeat measurements on small number of

individuals, different in the training and test set17

Logistic Regression• Desire to model the posterior probabilities of the

classes via linear functions in (p‐dimensional vector)

• Ensuring they sum to one and remain in • Model:

18

Logistic Regression• Specified in terms of log‐odds or logit

transformations• Choice of denominator is arbitrary – estimates are

equivariant under this choice

ℓ ℓℓ

ℓ ℓℓ Sum to 1

19

Logistic RegressionTwo‐class Classification

• For two‐class classification, we can model two classes as 0 and 1.

• Treating the class 1 as the concept of interest, the posterior probability can be regarded as the class membership probability:

Pr 1exp

1 exp

• As a result, it maps in p‐dimensional space to a value in [0,1]

20

logistic function

Logistic RegressionTwo‐Class Cases and Shape of Sigmoid Curve

• Consider 1‐dimensional

21

Pr

Logistic RegressionAn Example of One‐dimension

• We wish to predict death from baseline APACHE II score of patients.

• Let Pr be the probability that a patient with score will die.

22

• Note that linear regression would not work well since it could produce probabilities less than 0 or greater than 1


• Data that has a sharp survival cut off point between patients who live or die will lead to a large value of

23


• One the other hand, if the data has a lengthy transition from survival to death, it will lead to a low value of

24

Logistic RegressionModel Fitting for General Cases (K classes, p Dimension)

• Logistic regression models fit by maximum likelihood‐ using the conditional likelihood of given

• completely specifies the conditional distribution the multinomial distribution is appropriate

25

Logistic RegressionModel Fitting for General Cases (K classes, p Dimension)

• Let entire parameter set be

, then

• Log‐likelihood for observations of input data and class labels:

where

26

Logistic RegressionModel Fitting for Two‐class CasesTwo‐class cases• Convenient to code the two‐class via a 0/1 response

‐ where 1 when 1, and 0 when 2• Let ; ; , and ; 1 ;• Log‐likelihood:

ℓ log ; 1 log 1 ;

log 1

27

Logistic RegressionModel Fitting• Here:

• Assume vector of inputs includes the constant term 1 to accommodate the intercept

• Maximize log‐likelihood set its derivatives to zero

• Scores equations:

involves equations nonlinear in 28

Logistic RegressionNewton Method for Optimization

• Let’s consider a function of one scalar variable . The second order Taylor expansion around :

Δ Δ12Δ

• We want to find the global minimum ∗

• Near the minimum we could make a Taylor expansion:

∗ 12

∗

∗

• Newton method uses this fact, and minimizes a quadratic approximation to the function.

29


• Guess an initial point . We can take a second order Taylor expansion around and it will still be accurate:

12

Take the derivative with respect to and set it equal to 0122 0

0

30


• We just take the derivative with respect to and set it equal to zero at a point we will call

• We can iterate this procedure, minimizing one approximation and then using that to get a new approximation:

31

Logistic RegressionModel Fitting• Returning to the model fitting for the two‐class

case, the log‐likelihood is:

• Starting with , a single Newton update is:

32

Logistic RegressionModel Fitting• Let

denote the vector of valuesthe matrix of valuesthe vector of fitted probabilities with th

element a diagonal matrix of weights with th

diagonal element

33

Logistic RegressionModel Fitting

• ℓ

ℓ

• Newton Step:

34

Logistic RegressionExample• The subset of the Coronary Risk‐Factor Study

(CORIS) baseline survey, carried out in three rural areas of the Western Cape, South Africa

• Aim: establish the intensity of ischemic heart disease risk factors in that high‐incidence region

• Response variable is the presence or absence of myocardial infraction (MI) at the time of survey

• 160 cases in data set, sample of 302 controls

35

Logistic RegressionExample

36

Logistic RegressionExample

• Fit a logistic‐regression model by maximum likelihood, giving the results shown in the next slide• z scores for each coefficients in the

model (coefficients divided by their standard errors)

37

Logistic RegressionExample• Results from a logistic regression fit to the South

African heart disease data:

38

Coefficient Std. Error Z Score

(Intercept) ‐4.130 0.964 ‐4.285

sbp 0.006 0.006 1.023

tobacco 0.080 0.026 3.034

ldl 0.185 0.057 3.219

famhist 0.939 0.225 4.178

obesity ‐0.035 0.029 ‐1.187

alcohol 0.001 0.004 0.136

age 0.043 0.010 4.184

Logistic RegressionExample• z scores greater than approximately 2 in absolute

value is significant at the 5% level• Some surprises in the table of coefficients

• sbp and obesity appear to be not significant• On their own, both sbp and obesity are

significant, with positive sign• Presence of many other correlated variables no longer needed (can even get a negative sign)

39

L1 Regularized Logistic Regression

• The penalty used in the Lasso can be used for variable and shrinkage with any linear regression model.

• For logistic regression, we would maximize the log‐likelihood:

max,

log 1

• As with the Lasso, we typically do not penalize the intercept term, and standardize the predictors

40

Logistic Regression or LDA?

• For Linear Discriminant Analysis: log‐posterior odds between class and are linear functions of :

41


• The linear logistic regression model by construction has linear logits:

• It seems that linear discriminant analysis model and the linear logistic model are the same

• Although they have exactly the same form,• Different linear coefficients are estimated

• Logistic regression is more general due to less assumptions

42


• Joint density of and :

denotes marginal density of inputs • For both LDA and logistic regression, the second

term on the right has the logit‐linear form:

ℓ ℓℓ

43

Logistic Regression or LDA?• Logistic regression model leaves the marginal

density of as an arbitrary density function

• Fits the parameters of by maximizing the conditional likelihood – the multinomial likelihood with probabilities the

• Although is ignored, we can think of this marginal density as being estimated in a fully nonparametric and unrestricted fashion

• Using the empirical distribution function which places mass at each observation

44

Logistic Regression or LDA?• LDA fits parameters by maximizing the full log‐

likelihood, based on joint density:

where is the Gaussian density function• Standard normal theory leads easily to the

estimates , and • Marginal density does play a role. It is a

mixture density:

which also involves the parameters. 45


• For LDA, relying on additional model assumptions• more information about parameters• estimate them more efficiently (lower

variance)• If in fact the true are Gaussian

• in worst case: ignoring this marginal part of likelihood constitutes a lose of efficiency of about 30% asymptotically in error rate• With 30% more data, the conditional

likelihood will do as well.46


• For LDA, observations far from the decision boundary (down‐weighted by logistic regression) play a role in estimating the common covariance matrix• Not all good news: means that LDA is not

robust to gross outliers• The marginal likelihood can be thought of as a

regularizer• In practice, it is generally felt that logistic

regression is a safer, more robust bet than LDA, relying on fewer assumptions.

47

Documents

Linear Methods Classificationseem5470/lecture/Linear... · Linear Discriminant Analysis • Suppose Þ is the class‐conditional density of in class , i.e., • Let Þbe the prior