Maxent Models (II)

CMSC 473/673

September 20th, 2017

Announcements: Assignment 1

Due 11:59 PM, Saturday 9/23

~3.5 days

Use submit utility with:class id cs473_ferraroassignment id a1

We must be able to run it on GL!Common pitfall #1: forgetting filesCommon pitfall #2: incorrect paths to filesCommon pitfall #3: 3rd party libraries

Announcements: Question 6

𝑝(𝑛) 𝑥𝑖 𝑥𝑖−𝑛+1: 𝑥𝑖−1) = 𝜆𝑛𝑓(𝑛) 𝑥𝑖−𝑛+1: 𝑥𝑖 + 1 − 𝜆𝑛 𝑝(𝑛−1)(𝑥𝑖|𝑥𝑖−𝑛+2: 𝑥𝑖−1)

𝑝(𝑛) 𝑥𝑖 𝑥𝑖−𝑛+1: 𝑥𝑖−1) =

𝜆𝑛,𝑛𝑓(𝑛) 𝑥𝑖−𝑛+1: 𝑥𝑖 +

𝜆𝑛,𝑛−1𝑓(𝑛−1) 𝑥𝑖−𝑛+2: 𝑥𝑖 +

𝜆𝑛,0𝑓(0) ⋅

𝜆𝑛,0 = 1 −

𝑚=0

𝑛−1

𝜆𝑛,𝑛−𝑚

Announcements: Course Project

Official handout will be out Friday 9/22

Until then, focus on assignment 1

Teams of 1-3

Mixed undergrad/grad is encouraged but not required

Some novel aspect is needed

Ex 1: reimplement existing technique and apply to new domain

Ex 2: reimplement existing technique and apply to new (human) language

Ex 3: explore novel technique on existing problem

Recap from last time…

Classify or Decode with Bayes Rule

how well does text Yrepresent label X?

how likely is label X overall?

For “simple” or “flat” labels:* iterate through labels* evaluate score for each label, keeping only the best (n best)* return the best (or n best) label and score

Classification Evaluation:the 2-by-2 contingency table

ActuallyCorrect

Actually Incorrect

Selected/Guessed

True Positive (TP)

False Positive (FP)

Not selected/not guessed

False Negative (FN)

True Negative (TN)

Classes/Choices

Correct Guessed Correct Guessed

Classification Evaluation:Accuracy, Precision, and Recall

Accuracy: % of items correct

Precision: % of selected items that are correct

Recall: % of correct items that are selected

Actually Correct Actually Incorrect

Selected/Guessed True Positive (TP) False Positive (FP)

Not select/not guessed False Negative (FN) True Negative (TN)

Language Modeling as Naïve Bayes Classifier

Adopt naïve bag of words representation Yi

Assume position doesn’t matter

Assume the feature probabilities are independent given the class X

Naïve Bayes Summary

Potential Advantages

Very Fast, low storage requirements

Robust to Irrelevant Features

Very good in domains with many equally important features

Optimal if the independence assumptions hold

Dependable baseline for text classification (but often not the best)

Potential Issues

Model the posterior in one go?

Are the features really uncorrelated?

Are plain counts always appropriate?

Are there “better” ways of handling missing/noisy data?

(automated, more principled)

Potential Issues

Relevant for classification…

Potential Issues

Relevant for classification… andlanguage modeling

Maximum Entropy Models

a more general language model

argmax𝑋𝑝 𝑌 𝑋) ∗ 𝑝(𝑋)

Maximum Entropy Models

a more general language model

classify in one go

argmax𝑋𝑝 𝑌 𝑋) ∗ 𝑝(𝑋)

argmax𝑋𝑝 𝑋 𝑌)

ATTACKThree people have beenfatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.

We need to score the different combinations.

Document Classification

Score and Combine Our Possibilities

score1(fatally shot, ATTACK)

score2(seriously wounded, ATTACK)

score3(Shining Path, ATTACK)

COMBINEposterior

probability of ATTACK

are all of these uncorrelated?

…scorek(department, ATTACK)

Score and Combine Our Possibilities

COMBINEposterior

probability of ATTACK

Q: What are the score and combine functions for Naïve

Bayes?

Scoring Our Possibilities

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

score( , ) =ATTACK

Learn these scores… but how?

What do we optimize?

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .

p( | )∝ATTACK

SNAP(score( , ))ATTACK

Maxent Modeling

exp(score( , ))ATTACK

Maxent ModelingThree people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .

p( | )∝ATTACK

f(x) = exp(x)

exp gives a positive, unnormalized probability

exp( ))score1(fatally shot, ATTACK)

score3(Shining Path, ATTACK)…

Maxent Modeling

Learn the scores (but we’ll declare what combinations should be looked at)

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .

p( | )∝ATTACK

exp( ))weight1 * applies1(fatally shot, ATTACK)

weight2 * applies2(seriously wounded, ATTACK)

weight3 * applies3(Shining Path, ATTACK)…

Maxent ModelingThree people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .

p( | )∝ATTACK

Q: What if none of our features apply?

Guiding Principle for Log-Linear Models

“[The log-linear estimate]is the least biased estimate possible on the given information; i.e., it is maximally noncommittal with regard to missing information.”Edwin T. Jaynes, 1957

Guiding Principle for Log-Linear Models

“[The log-linear estimate]is the least biased estimate possible on the given information; i.e., it is maximally noncommittal with regard to missing information.”Edwin T. Jaynes, 1957

exp(θ· f) exp(θ· 0) = 1

exp( ))θ1 * f1(fatally shot, ATTACK)

θ 2 * f2(seriously wounded, ATTACK)

θ 3 * f 3(Shining Path, ATTACK)…

Easier-to-write form

exp( ))θ1 * f1(fatally shot, ATTACK)

θ 2 * f2(seriously wounded, ATTACK)

θ 3 * f 3(Shining Path, ATTACK)…

K weights

K features

exp( )θ · f (doc, ATTACK)

K-dimensional weight vector

K-dimensional feature vector

dot product

Log-Linear Models

Feature function(s)Sufficient statistics

“Strength” function(s)

Log-Linear Models

Feature WeightsNatural parameters

Distribution Parameters

Log-Linear Models

How do we normalize?

exp( )weight1 * applies1(fatally shot, X)

weight2 * applies2(seriously wounded, X)

weight3 * applies3(Shining Path, X)…

Σlabel x

Z =Normalization for Classification

𝑝 𝑥 𝑦) ∝ exp(𝜃 ⋅ 𝑓 𝑥, 𝑦 ) classify doc y with label x in one go

Normalization for Language Model

general class-based (X) language model of doc y

Can be significantly harder in the general case

Simplifying assumption: maxent n-grams!

https://www.csee.umbc.edu/courses/undergraduate/473/f17/loglin-tutorial/

https://goo.gl/B23Rxo

Lesson 1

Connections to Other Techniques

Log-Linear Models

(Multinomial) logistic regression

Softmax regressionas statistical regression

Log-Linear Models

Softmax regression

Maximum Entropy models (MaxEnt)

as statistical regression

based in information theory

Log-Linear Models

Softmax regression

Generalized Linear Models

a form of

Log-Linear Models

Softmax regression

Discriminative Naïve Bayes

a form of

viewed as

Log-Linear Models

Softmax regression

Discriminative Naïve Bayes

Very shallow (sigmoidal) neural nets

a form of

viewed as

to be cool today :)

Learning θ

pθ(y | x ) probabilistic model

objective(given observations)

How will we optimize F(θ)?

Calculus

F’(θ)derivative of F

wrt θ

Example

F’(x) = -2x + 4

F(x) = -(x-2)2

differentiate

Solve F’(x) = 0

Common Derivative Rules

F’(θ)derivative of F wrt θ

What if you can’t find the roots? Follow the derivative

Set t = 0Pick a starting value θt

Until converged:1. Get value y t = F(θ t)

Until converged:1. Get value y t = F(θ t)2. Get derivative g t = F’(θ t)

Until converged:1. Get value y t = F(θ t)2. Get derivative g t = F’(θ t)3. Get scaling factor ρ t

4. Set θ t+1 = θ t + ρ t *g t

5. Set t += 1θ0

4. Set θ t+1 = θ t + ρ t *g t

5. Set t += 1θ0

4. Set θ t+1 = θ t + ρ t *g t

5. Set t += 1θ0

Set t = 0

Pick a starting value θt

Until converged:

1. Get value y t = F(θ t)2. Get derivative g t = F’(θ t)

3. Get scaling factor ρ t4. Set θ t+1 = θ t + ρ t *g t

5. Set t += 1θ0

Gradient = Multi-variable derivative

K-dimensional input

K-dimensional output

Gradient Ascent

Maxent Models (II)

Documents

MaxEnt modelling for predicting the potential distribution

Maxent Models and Discriminative Estimation...Features In these slides and most maxent work: features are elementary pieces of evidence that link aspects of what we observe d with

a maxent approach to textsetting - UCLA Department of Linguistics

MaxEnt: Statistical Explanation (Elithet. al. 2011)bioff.forest.ku.ac.th/PDF_FILE/SEP_2012/MAXENT_HAND.pdf• The MaxEnt fitted function is usually defined over many features, meaning

Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen Feb 8 IE Lecture

Cory Merow · 2019-11-18 · Modeling Presence Only Data Cory Merow Yale University. Occurrence Models Annual Precipitation Presences Maxent Software Predictions

Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD

Maxent Grammars for the Metrics of Shakespeare and Miltonlinguistics.ucla.edu/people/hayes/papers/HayesWilsonShiskoMaxent... · Maxent Grammars for the Metrics . ... our testing indicates

Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy

Retrieval Models II

Variable schwa realization in Canadian French: A MaxEnt grammar

Intro to Maxent

Network Models II

Maximum entropy models for generation of expressive music · and its symbolic representation, i.e. a musical score. In this paper, we show how Maximum Entropy (MaxEnt) models can

Maxent Models and Discriminative Estimation

MaxEnt (Loglinear) Models - Overview

Overviewandconstructionofmeshfreebasisfunctions:From ...dilbert.engr.ucdavis.edu/~suku/maxent/papers/maxent...and ﬁnite element and meshfree methods to name a few. In this paper,

Cocomo II Models

Maxent Models: Foundationsyanju/misc/MaxentModels.pdf · •We consider each class for an observed datum d ... •A thought experiment which suggests how Second Law of Thermodynamics

Joint Ambiguity Modeling in NLP - uio.no · Joint Ambiguity Modeling in NLP Woodley Packard Modeling Ambiguity Syntax Disambiguation Maximum Entropy Models MaxEnt for …