CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT...

Supervised Classification

CMSC 723 / LING 723 / INST 725

MARINE CARPUAT

marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein

Last time

• Text classification problems

– and their evaluation

• Linear classifiers

– Features & Weights

– Bag of words

– Naïve Bayes

Machine Learning, Probability

Linguistics

• 3 linear classifiers

– Naïve Bayes

– Perceptron

– (Logistic Regression)

• Bag of words vs. rich feature sets

• Generative vs. discriminative models

• Bias-variance tradeoff

Naïve Bayes Recap

The Naivety of Naïve Bayes

Conditional independence

assumption

Naïve Bayes: Example

Smoothing

• Goal: assign some probability mass to events

that were not seen during training

• One method: “add alpha” smoothing

– Often, alpha = 1

Multinomial Naïve Bayes:

Learning in Practice

• Calculate P(yj) terms– For each yj in Y do

docsj all docs with class =yj

Vocabularyn

nywP k

|documents # total|

docsyP

• Calculate P(wk | yj) terms• Textj single doc containing all docsj• For each word wk in Vocabulary

nk # of occurrences of wk in Textj

• From training corpus, extract Vocabulary

Bias Variance trade-off

• Variance of a classifier

– How much its decisions are affected by small changes

in training sets

– Lower variance = smaller changes

• Bias of a classifier

– How accurate it is at modeling different training sets

– Lower bias = more accurate

• High variance classifiers tend to overfit

• High bias classifiers tend to underfit

Bias Variance trade-off

• Impact of smoothing

– Lowers variance

– Increases bias (toward uniform probabilities)

Naïve Bayes

• A linear classifier whose weights can be

interpreted as parameters of a probabilistic model

• Pros

– parameters are easy to estimate from data:

“count and normalize” (and smooth)

• Cons

– requires making a conditional independence

assumption

– which does not hold in practice

– Naïve Bayes

– Perceptron

– Logistic Regression

– Smoothing, regularization

Beyond Bag of Words for

classification tasks

Given an introductory sentence in Wikipedia

predict whether the article is about a person

Designing features

Predicting requires

combining information• Given features and weights

• Predicting for a new example:

– If (sum of weights > 0), “yes”; otherwise “no”

Formalizing binary classification

with linear models

Example feature functions:

Unigram features• Number of times a particular word appears

– i.e. bag of words

An online learning algorithm

Perceptron weight update

• If y = 1, increase the weights for features in

• If y = -1, decrease the weights for features in

Example: initial update

Example: second update

Perceptron

• A linear model for classification

• An algorithm to learn feature weights given

labeled data

– online algorithm

– error-driven

– Does it converge?

• See “A Course In Machine Learning” Ch.3

Multiclass perceptron

Bias Variance trade off

• How do we decide when to stop?

– Accuracy on held out data

– Early stopping

• Averaged perceptron

– Improves generalization

Averaged perceptron

Learning as optimization:

Loss functions• Naïve Bayes chooses weights to maximize the

joint likelihood of the training data (or log

likelihood)

Perceptron Loss function

• “0-1” loss

• Treats all errors equally

• Does not care about confidence of

classification decision

– Naïve Bayes

– Perceptron

Perceptron & Probabilities

• What if we want a probability p(y|x)?

• The perceptron gives us a prediction y

The logistic function

• “Softer” function than in perceptron

• Can account for uncertainty

• Differentiable

Logistic regression: how to train?

• Train based on conditional likelihood

• Find parameters w that maximize conditional

likelihood of all answers 𝑦𝑖 given examples 𝑥𝑖

Stochastic gradient ascent

(or descent)• Online training algorithm for logistic regression

– and other probabilistic models

• Update weights for every training example• Move in direction given by gradient• Size of update step scaled by learning rate

Gradient of the logistic function

Example: initial update

Example: second update

How to set the learning rate?

• Various strategies

• decay over time

𝛼 =1

𝐶 + 𝑡

• Use held-out test set, increase learning rate

when likelihood increases

ParameterNumber of

samples

Some models are

better then others…• Consider these 2 examples

• Which of the 2 models below is better?

Classifier 2 will probably generalize better!It does not include irrelevant information=> Smaller model is better

Regularization

• A penalty on adding extra

weights

• L2 regularization:

– big penalty on large

weights

– small penalty on small

weights

• L1 regularization:

– Uniform increase when

large or small

– Will cause many weights to

become zero

𝑤 2

𝑤 1

L1 regularization in online learning

– Naïve Bayes

– Perceptron

CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT...

Documents

(some slides from Marine Carpuat, Nathan Schneider) March

CMSC 723 Computational Linguistics Ilintool.github.io/UMD-courses/CMSC723-2009-Fall/... · Python Datatypes •No explicit datatype declaration •An object has a ﬁxed type, once

CMSC 723: Intro to Computational Linguistics

Semantic Textual Similarity & more on Alignment · Semantic Textual Similarity & more on Alignment CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu

Express 723

Finite-State Morphology CMSC 723: Computational Linguistics I ― Session #3 Jimmy Lin The iSchool University of Maryland Wednesday, September 16, 2009

CMSC Delivery

Introduction to Natural Language Processing · Introduction to Natural Language Processing CMSC 470 Marine Carpuat. Where we started on the 1st day of class •Levels of linguistic

CMSC 723: Intro to Computational Linguistics Lecture 2: February 4, 2004 Regular Expressions and Finite State Automata Professor Bonnie J. Dorr Dr. Nizar

Word Sense Disambiguation CMSC 723: Computational Linguistics I ― Session #11 Jimmy Lin The iSchool University of Maryland Wednesday, November 11, 2009

Introduction to Machine Learning · Introduction to Machine Learning CMSC 422 MARINE CARPUAT marine@cs.umd.edu

CMSC 723 / LING 645: Intro to Computational Linguistics September 15, 2004: Dorr More about FSA’s, Finite State Morphology (J&M 3) Prof. Bonnie J. Dorr

CMSC 723 / LING 645: Intro to Computational Linguistics January 28, 2004 Lecture 1 (Dorr): Overview, History, Goals, Problems, Techniques; Intro to MT

CMSC 723 / LING 645: Intro to Computational Linguistics September 22, 2004: Dorr Porter Stemmer, Intro to Probabilistic NLP and N-grams (chap 6.1-6.3)

N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009

Text Classification & Linear Models · Text Classification & Linear Models CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Dan Jurafsky& James Martin, Jacob Eisenstein

CMSC 330: Organization of Programming Languages€¦ · CMSC 330: Organization of Programming Languages Lambda Calculus Encodings CMSC 330 Fall 2019 1

CMSC$601:$ Topics$ · CMSC$601:$ Topics$ Adapted’from’slides’by’ Prof.’Marie’desJardins’ February 2011

Chapter 723

1 Statistical Machine Translation Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 8 October 27, 2004