22
Supervised Classifiers: Naïve Bayes, Perceptron, Logistic Regression CMSC 723 / LING 723 / INST 725 MARINE CARPUAT [email protected] Some slides by Graham Neubig , Jacob Eisenstein

Supervised Classifiers: Naïve Bayes, Perceptron, Logistic ... · • Standard supervised learning set-up for text classification –Difference between train vs. test data –How

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Supervised Classifiers: Naïve Bayes, Perceptron, Logistic ... · • Standard supervised learning set-up for text classification –Difference between train vs. test data –How

Supervised Classifiers:

Naïve Bayes, Perceptron,

Logistic Regression

CMSC 723 / LING 723 / INST 725

MARINE CARPUAT

[email protected] Some slides by Graham Neubig , Jacob Eisenstein

Page 2: Supervised Classifiers: Naïve Bayes, Perceptron, Logistic ... · • Standard supervised learning set-up for text classification –Difference between train vs. test data –How

Today

• 3 linear classifiers

– Naïve Bayes

– Perceptron

– Logistic Regression

• Bag of words vs. rich feature sets

• Generative vs. discriminative models

• Bias-variance tradeoff

Page 3: Supervised Classifiers: Naïve Bayes, Perceptron, Logistic ... · • Standard supervised learning set-up for text classification –Difference between train vs. test data –How

An online learning algorithm

Page 4: Supervised Classifiers: Naïve Bayes, Perceptron, Logistic ... · • Standard supervised learning set-up for text classification –Difference between train vs. test data –How

Perceptron weight update

• If y = 1, increase the weights for features in

• If y = -1, decrease the weights for features in

Page 5: Supervised Classifiers: Naïve Bayes, Perceptron, Logistic ... · • Standard supervised learning set-up for text classification –Difference between train vs. test data –How

The perceptron

• A linear model for classification

• An algorithm to learn feature weights given

labeled data

– online algorithm

– error-driven

– Does it converge?

• See “A Course In Machine Learning” Ch.3

Page 6: Supervised Classifiers: Naïve Bayes, Perceptron, Logistic ... · • Standard supervised learning set-up for text classification –Difference between train vs. test data –How

When to stop?

• One technique

– When the accuracy on held out data starts to

decrease

– Early stopping

Page 7: Supervised Classifiers: Naïve Bayes, Perceptron, Logistic ... · • Standard supervised learning set-up for text classification –Difference between train vs. test data –How

Multiclass perceptron

Page 8: Supervised Classifiers: Naïve Bayes, Perceptron, Logistic ... · • Standard supervised learning set-up for text classification –Difference between train vs. test data –How

Averaged Perceptron

Page 9: Supervised Classifiers: Naïve Bayes, Perceptron, Logistic ... · • Standard supervised learning set-up for text classification –Difference between train vs. test data –How

What objective/loss does the

perceptron optimize?

• Zero-one loss function

• What are the pros and cons compared to

Naïve Bayes loss?

Page 10: Supervised Classifiers: Naïve Bayes, Perceptron, Logistic ... · • Standard supervised learning set-up for text classification –Difference between train vs. test data –How

Perceptron & Probabilities

• What if we want a probability p(y|x)?

• The perceptron gives us a prediction y

Page 11: Supervised Classifiers: Naïve Bayes, Perceptron, Logistic ... · • Standard supervised learning set-up for text classification –Difference between train vs. test data –How

The logistic function

• “Softer” function than in perceptron

• Can account for uncertainty

• Differentiable

Page 12: Supervised Classifiers: Naïve Bayes, Perceptron, Logistic ... · • Standard supervised learning set-up for text classification –Difference between train vs. test data –How

Logistic regression: how to train?

• Train based on conditional likelihood

• Find parameters w that maximize conditional

likelihood of all answers 𝑦𝑖 given examples 𝑥𝑖

Page 13: Supervised Classifiers: Naïve Bayes, Perceptron, Logistic ... · • Standard supervised learning set-up for text classification –Difference between train vs. test data –How

Stochastic gradient ascent

(or descent)• Online training algorithm for logistic regression

– and other probabilistic models

• Update weights for every training example• Move in direction given by gradient• Size of update step scaled by learning rate

Page 14: Supervised Classifiers: Naïve Bayes, Perceptron, Logistic ... · • Standard supervised learning set-up for text classification –Difference between train vs. test data –How

Gradient of the logistic function

Page 15: Supervised Classifiers: Naïve Bayes, Perceptron, Logistic ... · • Standard supervised learning set-up for text classification –Difference between train vs. test data –How

Example: initial update

Page 16: Supervised Classifiers: Naïve Bayes, Perceptron, Logistic ... · • Standard supervised learning set-up for text classification –Difference between train vs. test data –How

Example: second update

Page 17: Supervised Classifiers: Naïve Bayes, Perceptron, Logistic ... · • Standard supervised learning set-up for text classification –Difference between train vs. test data –How

How to set the learning rate?

• Various strategies

• decay over time

𝛼 =1

𝐶 + 𝑡

• Use held-out test set, increase learning rate

when likelihood increases

ParameterNumber of

samples

Page 18: Supervised Classifiers: Naïve Bayes, Perceptron, Logistic ... · • Standard supervised learning set-up for text classification –Difference between train vs. test data –How

Multiclass version

Page 19: Supervised Classifiers: Naïve Bayes, Perceptron, Logistic ... · • Standard supervised learning set-up for text classification –Difference between train vs. test data –How

Some models are

better then others…• Consider these 2 examples

• Which of the 2 models below is better?

Classifier 2 will probably generalize better!It does not include irrelevant information=> Smaller model is better

Page 20: Supervised Classifiers: Naïve Bayes, Perceptron, Logistic ... · • Standard supervised learning set-up for text classification –Difference between train vs. test data –How

Regularization

• A penalty on adding extra

weights

• L2 regularization:

– big penalty on large

weights

– small penalty on small

weights

• L1 regularization:

– Uniform increase when

large or small

– Will cause many weights to

become zero

𝑤 2

𝑤 1

Page 21: Supervised Classifiers: Naïve Bayes, Perceptron, Logistic ... · • Standard supervised learning set-up for text classification –Difference between train vs. test data –How

L1 regularization in online learning

Page 22: Supervised Classifiers: Naïve Bayes, Perceptron, Logistic ... · • Standard supervised learning set-up for text classification –Difference between train vs. test data –How

What you should know

• Standard supervised learning set-up for text classification

– Difference between train vs. test data

– How to evaluate

• 3 examples of supervised linear classifiers

– Naïve Bayes, Perceptron, Logistic Regression

– Learning as optimization: what is the objective function

optimized?

– Difference between generative vs. discriminative classifiers

– Bias-Variance trade-off, smoothing, regularization