Middle Term Exam

Logistic Regression

Rong Jin

Logistic Regression

• Generative models often lead to linear decision boundary

• Linear discriminatory model• Directly model the linear decision boundary

• w is the parameter to be decided

Logistic Regression

Logistic Regression

Learn parameter w by Maximum Likelihood Estimation (MLE)

• Given training data

Logistic Regression

• Convex objective function, global optimal• Gradient descent Classification error

Logistic Regression

• Convex objective function, global optimal• Gradient descent Classification error

Illustration of Gradient Descent

How to Decide the Step Size ?

• Back track line search

Example: Heart Disease

• Input feature x: age group id• Output y: if having heart disease

• y=1: having heart disease• y=-1: no heart disease

1: 25-29

2: 30-34

3: 35-39

4: 40-44

5: 45-49

6: 50-54

7: 55-59

8: 60-64

0

2

4

6

8

10

1 2 3 4 5 6 7 8

Age group

Num

ber o

f Peo

ple

No heart Disease

Heart disease

Example: Heart Disease

0

2

4

6

8

10

1 2 3 4 5 6 7 8

Age group

Num

ber o

f Peo

ple

No heart Disease

Heart disease

Example: Text Categorization

Learn to classify text into two categories• Input d: a document, represented by a word

histogram• Output y=1: +1 for political document, -1 for non-

political document

Example: Text Categorization

• Training data

Example 2: Text Classification

• Dataset: Reuter-21578• Classification accuracy

• Naïve Bayes: 77%• Logistic regression: 88%

Logistic Regression vs. Naïve Bayes

• Both are linear decision boundaries

• Naïve Bayes:

• Logistic regression: learn weights by MLE• Both can be viewed as modeling p(d|y)

• Naïve Bayes: independence assumption• Logistic regression: assume an exponential family

distribution for p(d|y) (a broad assumption)

Logistic Regression vs. Naïve Bayes

Discriminative vs. Generative

Discriminative ModelsModel P(y|x) Pros• Usually good performance Cons• Slow convergence• Expensive computation• Sensitive to noise data

Generative ModelsModel P(x|y)Pros• Usually fast converge• Cheap computation• Robust to noise dataCons• Usually performs worse

Overfitting Problem

Consider text categorization

• What is the weight for a word j appears in only one training document dk?

Overfitting Problem

Using regularization Without regularization

Iteration

Overfitting Problem

Decrease in the classification accuracy of test data

Solution: Regularization

Regularized log-likelihood

The effects of regularizer• Favor small weights• Guarantee bounded norm of w• Guarantee the unique solution

Regularized Logistic Regression

Using regularization Without regularization

Iteration

Classification performance by regularization

Regularization as Robust Optimization

• Assume each data point is unknown but bounded in a sphere of radius r and center xi

Sparse Solution by Lasso Regularization

RCV1 collection: • 800K documents• 47K unique words

Sparse Solution by Lasso Regularization

How to solve the optimization problem?• Subgradient descent• Minimax

Bayesian Treatment

• Compute the posterior distribution of w

• Laplacian approximation

Bayesian Treatment

• Laplacian approximation

Multi-class Logistic Regression

• How to extend logistic regression model to multi-class classification ?

Conditional Exponential Model

• Let classes be

• Need to learn

Normalization factor (partition function)

Conditional Exponential Model

• Learn weights ws by maximum likelihood estimation

• Any problem ?

Modified Conditional Exponential Model

Documents

Middle Term Exam