40
Supervised Classification CMSC 723 / LING 723 / INST 725 MARINE CARPUAT [email protected] Some slides by Graham Neubig , Jacob Eisenstein

CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT [email protected] Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text

Supervised Classification

CMSC 723 / LING 723 / INST 725

MARINE CARPUAT

[email protected] Some slides by Graham Neubig , Jacob Eisenstein

Page 2: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text

Last time

• Text classification problems

– and their evaluation

• Linear classifiers

– Features & Weights

– Bag of words

– Naïve Bayes

Machine Learning, Probability

Linguistics

Page 3: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text

Today

• 3 linear classifiers

– Naïve Bayes

– Perceptron

– (Logistic Regression)

• Bag of words vs. rich feature sets

• Generative vs. discriminative models

• Bias-variance tradeoff

Page 4: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text

Naïve Bayes Recap

Page 5: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text

The Naivety of Naïve Bayes

Conditional independence

assumption

Page 6: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text

Naïve Bayes: Example

Page 7: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text

Smoothing

• Goal: assign some probability mass to events

that were not seen during training

• One method: “add alpha” smoothing

– Often, alpha = 1

Page 8: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text

Multinomial Naïve Bayes:

Learning in Practice

• Calculate P(yj) terms– For each yj in Y do

docsj all docs with class =yj

||)|(

Vocabularyn

nywP k

jk

|documents # total|

||)(

j

j

docsyP

• Calculate P(wk | yj) terms• Textj single doc containing all docsj• For each word wk in Vocabulary

nk # of occurrences of wk in Textj

• From training corpus, extract Vocabulary

Page 9: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text

Bias Variance trade-off

• Variance of a classifier

– How much its decisions are affected by small changes

in training sets

– Lower variance = smaller changes

• Bias of a classifier

– How accurate it is at modeling different training sets

– Lower bias = more accurate

• High variance classifiers tend to overfit

• High bias classifiers tend to underfit

Page 10: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text

Bias Variance trade-off

• Impact of smoothing

– Lowers variance

– Increases bias (toward uniform probabilities)

Page 11: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text

Naïve Bayes

• A linear classifier whose weights can be

interpreted as parameters of a probabilistic model

• Pros

– parameters are easy to estimate from data:

“count and normalize” (and smooth)

• Cons

– requires making a conditional independence

assumption

– which does not hold in practice

Page 12: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text

Today

• 3 linear classifiers

– Naïve Bayes

– Perceptron

– Logistic Regression

• Bag of words vs. rich feature sets

• Generative vs. discriminative models

• Bias-variance tradeoff

– Smoothing, regularization

Page 13: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text

Beyond Bag of Words for

classification tasks

Given an introductory sentence in Wikipedia

predict whether the article is about a person

Page 14: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text

Designing features

Page 15: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text

Predicting requires

combining information• Given features and weights

• Predicting for a new example:

– If (sum of weights > 0), “yes”; otherwise “no”

Page 16: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text

Formalizing binary classification

with linear models

Page 17: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text

Example feature functions:

Unigram features• Number of times a particular word appears

– i.e. bag of words

Page 18: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text

An online learning algorithm

Page 19: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text

Perceptron weight update

• If y = 1, increase the weights for features in

• If y = -1, decrease the weights for features in

Page 20: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text

Example: initial update

Page 21: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text

Example: second update

Page 22: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text

Perceptron

• A linear model for classification

• An algorithm to learn feature weights given

labeled data

– online algorithm

– error-driven

– Does it converge?

• See “A Course In Machine Learning” Ch.3

Page 23: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text

Multiclass perceptron

Page 24: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text

Bias Variance trade off

• How do we decide when to stop?

– Accuracy on held out data

– Early stopping

• Averaged perceptron

– Improves generalization

Page 25: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text

Averaged perceptron

Page 26: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text

Learning as optimization:

Loss functions• Naïve Bayes chooses weights to maximize the

joint likelihood of the training data (or log

likelihood)

Page 27: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text

Perceptron Loss function

• “0-1” loss

• Treats all errors equally

• Does not care about confidence of

classification decision

Page 28: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text

Today

• 3 linear classifiers

– Naïve Bayes

– Perceptron

– (Logistic Regression)

• Bag of words vs. rich feature sets

• Generative vs. discriminative models

• Bias-variance tradeoff

Page 29: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text

Perceptron & Probabilities

• What if we want a probability p(y|x)?

• The perceptron gives us a prediction y

Page 30: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text

The logistic function

• “Softer” function than in perceptron

• Can account for uncertainty

• Differentiable

Page 31: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text

Logistic regression: how to train?

• Train based on conditional likelihood

• Find parameters w that maximize conditional

likelihood of all answers 𝑦𝑖 given examples 𝑥𝑖

Page 32: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text

Stochastic gradient ascent

(or descent)• Online training algorithm for logistic regression

– and other probabilistic models

• Update weights for every training example• Move in direction given by gradient• Size of update step scaled by learning rate

Page 33: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text

Gradient of the logistic function

Page 34: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text

Example: initial update

Page 35: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text

Example: second update

Page 36: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text

How to set the learning rate?

• Various strategies

• decay over time

𝛼 =1

𝐶 + 𝑡

• Use held-out test set, increase learning rate

when likelihood increases

ParameterNumber of

samples

Page 37: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text

Some models are

better then others…• Consider these 2 examples

• Which of the 2 models below is better?

Classifier 2 will probably generalize better!It does not include irrelevant information=> Smaller model is better

Page 38: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text

Regularization

• A penalty on adding extra

weights

• L2 regularization:

– big penalty on large

weights

– small penalty on small

weights

• L1 regularization:

– Uniform increase when

large or small

– Will cause many weights to

become zero

𝑤 2

𝑤 1

Page 39: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text

L1 regularization in online learning

Page 40: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text

Today

• 3 linear classifiers

– Naïve Bayes

– Perceptron

– (Logistic Regression)

• Bag of words vs. rich feature sets

• Generative vs. discriminative models

• Bias-variance tradeoff