Maxent Models (II)

Preview:

Citation preview

Maxent Models (II)

CMSC 473/673

UMBC

September 20th, 2017

Announcements: Assignment 1

Due 11:59 PM, Saturday 9/23

~3.5 days

Use submit utility with:class id cs473_ferraroassignment id a1

We must be able to run it on GL!Common pitfall #1: forgetting filesCommon pitfall #2: incorrect paths to filesCommon pitfall #3: 3rd party libraries

Announcements: Question 6

𝑝(𝑛) 𝑥𝑖 𝑥𝑖−𝑛+1: 𝑥𝑖−1) = 𝜆𝑛𝑓(𝑛) 𝑥𝑖−𝑛+1: 𝑥𝑖 + 1 − 𝜆𝑛 𝑝(𝑛−1)(𝑥𝑖|𝑥𝑖−𝑛+2: 𝑥𝑖−1)

𝑝(𝑛) 𝑥𝑖 𝑥𝑖−𝑛+1: 𝑥𝑖−1) =

𝜆𝑛,𝑛𝑓(𝑛) 𝑥𝑖−𝑛+1: 𝑥𝑖 +

𝜆𝑛,𝑛−1𝑓(𝑛−1) 𝑥𝑖−𝑛+2: 𝑥𝑖 +

𝜆𝑛,0𝑓(0) ⋅

𝜆𝑛,0 = 1 −

𝑚=0

𝑛−1

𝜆𝑛,𝑛−𝑚

Announcements: Course Project

Official handout will be out Friday 9/22

Until then, focus on assignment 1

Teams of 1-3

Mixed undergrad/grad is encouraged but not required

Some novel aspect is needed

Ex 1: reimplement existing technique and apply to new domain

Ex 2: reimplement existing technique and apply to new (human) language

Ex 3: explore novel technique on existing problem

Recap from last time…

Classify or Decode with Bayes Rule

how well does text Yrepresent label X?

how likely is label X overall?

For “simple” or “flat” labels:* iterate through labels* evaluate score for each label, keeping only the best (n best)* return the best (or n best) label and score

Classification Evaluation:the 2-by-2 contingency table

ActuallyCorrect

Actually Incorrect

Selected/Guessed

True Positive (TP)

False Positive (FP)

Not selected/not guessed

False Negative (FN)

True Negative (TN)

Classes/Choices

Correct Guessed Correct Guessed

Correct Guessed Correct Guessed

Classification Evaluation:Accuracy, Precision, and Recall

Accuracy: % of items correct

Precision: % of selected items that are correct

Recall: % of correct items that are selected

Actually Correct Actually Incorrect

Selected/Guessed True Positive (TP) False Positive (FP)

Not select/not guessed False Negative (FN) True Negative (TN)

Language Modeling as Naïve Bayes Classifier

Adopt naïve bag of words representation Yi

Assume position doesn’t matter

Assume the feature probabilities are independent given the class X

Naïve Bayes Summary

Potential Advantages

Very Fast, low storage requirements

Robust to Irrelevant Features

Very good in domains with many equally important features

Optimal if the independence assumptions hold

Dependable baseline for text classification (but often not the best)

Potential Issues

Model the posterior in one go?

Are the features really uncorrelated?

Are plain counts always appropriate?

Are there “better” ways of handling missing/noisy data?

(automated, more principled)

Naïve Bayes Summary

Potential Advantages

Very Fast, low storage requirements

Robust to Irrelevant Features

Very good in domains with many equally important features

Optimal if the independence assumptions hold

Dependable baseline for text classification (but often not the best)

Potential Issues

Model the posterior in one go?

Are the features really uncorrelated?

Are plain counts always appropriate?

Are there “better” ways of handling missing/noisy data?

(automated, more principled)

Model the posterior in one go?

Are the features really uncorrelated?

Are plain counts always appropriate?

Are there “better” ways of handling missing/noisy data?

(automated, more principled)

Relevant for classification…

Naïve Bayes Summary

Potential Advantages

Very Fast, low storage requirements

Robust to Irrelevant Features

Very good in domains with many equally important features

Optimal if the independence assumptions hold

Dependable baseline for text classification (but often not the best)

Potential Issues

Model the posterior in one go?

Are the features really uncorrelated?

Are plain counts always appropriate?

Are there “better” ways of handling missing/noisy data?

(automated, more principled)

Model the posterior in one go?

Are the features really uncorrelated?

Are plain counts always appropriate?

Are there “better” ways of handling missing/noisy data?

(automated, more principled)

Relevant for classification… andlanguage modeling

Maximum Entropy Models

a more general language model

argmax𝑋𝑝 𝑌 𝑋) ∗ 𝑝(𝑋)

Maximum Entropy Models

a more general language model

classify in one go

argmax𝑋𝑝 𝑌 𝑋) ∗ 𝑝(𝑋)

argmax𝑋𝑝 𝑋 𝑌)

ATTACKThree people have beenfatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.

We need to score the different combinations.

Document Classification

Score and Combine Our Possibilities

score1(fatally shot, ATTACK)

score2(seriously wounded, ATTACK)

score3(Shining Path, ATTACK)

COMBINEposterior

probability of ATTACK

are all of these uncorrelated?

…scorek(department, ATTACK)

Score and Combine Our Possibilities

score1(fatally shot, ATTACK)

score2(seriously wounded, ATTACK)

score3(Shining Path, ATTACK)

COMBINEposterior

probability of ATTACK

Q: What are the score and combine functions for Naïve

Bayes?

Scoring Our Possibilities

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

score( , ) =ATTACK

score1(fatally shot, ATTACK)

score2(seriously wounded, ATTACK)

score3(Shining Path, ATTACK)

Learn these scores… but how?

What do we optimize?

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .

p( | )∝ATTACK

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

SNAP(score( , ))ATTACK

Maxent Modeling

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

exp(score( , ))ATTACK

Maxent ModelingThree people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .

p( | )∝ATTACK

f(x) = exp(x)

exp gives a positive, unnormalized probability

exp( ))score1(fatally shot, ATTACK)

score2(seriously wounded, ATTACK)

score3(Shining Path, ATTACK)…

Maxent Modeling

Learn the scores (but we’ll declare what combinations should be looked at)

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .

p( | )∝ATTACK

exp( ))weight1 * applies1(fatally shot, ATTACK)

weight2 * applies2(seriously wounded, ATTACK)

weight3 * applies3(Shining Path, ATTACK)…

Maxent ModelingThree people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .

p( | )∝ATTACK

Q: What if none of our features apply?

Guiding Principle for Log-Linear Models

“[The log-linear estimate]is the least biased estimate possible on the given information; i.e., it is maximally noncommittal with regard to missing information.”Edwin T. Jaynes, 1957

Guiding Principle for Log-Linear Models

“[The log-linear estimate]is the least biased estimate possible on the given information; i.e., it is maximally noncommittal with regard to missing information.”Edwin T. Jaynes, 1957

exp(θ· f) exp(θ· 0) = 1

exp( ))θ1 * f1(fatally shot, ATTACK)

θ 2 * f2(seriously wounded, ATTACK)

θ 3 * f 3(Shining Path, ATTACK)…

Easier-to-write form

exp( ))θ1 * f1(fatally shot, ATTACK)

θ 2 * f2(seriously wounded, ATTACK)

θ 3 * f 3(Shining Path, ATTACK)…

Easier-to-write form

K weights

K features

exp( )θ · f (doc, ATTACK)

Easier-to-write form

K-dimensional weight vector

K-dimensional feature vector

dot product

Log-Linear Models

Log-Linear Models

Log-Linear Models

Feature function(s)Sufficient statistics

“Strength” function(s)

Log-Linear Models

Feature WeightsNatural parameters

Distribution Parameters

Log-Linear Models

How do we normalize?

exp( )weight1 * applies1(fatally shot, X)

weight2 * applies2(seriously wounded, X)

weight3 * applies3(Shining Path, X)…

Σlabel x

Z =Normalization for Classification

𝑝 𝑥 𝑦) ∝ exp(𝜃 ⋅ 𝑓 𝑥, 𝑦 ) classify doc y with label x in one go

Normalization for Language Model

general class-based (X) language model of doc y

Normalization for Language Model

Can be significantly harder in the general case

general class-based (X) language model of doc y

Normalization for Language Model

Can be significantly harder in the general case

Simplifying assumption: maxent n-grams!

general class-based (X) language model of doc y

https://www.csee.umbc.edu/courses/undergraduate/473/f17/loglin-tutorial/

https://goo.gl/B23Rxo

Lesson 1

Connections to Other Techniques

Log-Linear Models

Connections to Other Techniques

Log-Linear Models

(Multinomial) logistic regression

Softmax regressionas statistical regression

Connections to Other Techniques

Log-Linear Models

(Multinomial) logistic regression

Softmax regression

Maximum Entropy models (MaxEnt)

as statistical regression

based in information theory

Connections to Other Techniques

Log-Linear Models

(Multinomial) logistic regression

Softmax regression

Maximum Entropy models (MaxEnt)

Generalized Linear Models

as statistical regression

a form of

based in information theory

Connections to Other Techniques

Log-Linear Models

(Multinomial) logistic regression

Softmax regression

Maximum Entropy models (MaxEnt)

Generalized Linear Models

Discriminative Naïve Bayes

as statistical regression

a form of

viewed as

based in information theory

Connections to Other Techniques

Log-Linear Models

(Multinomial) logistic regression

Softmax regression

Maximum Entropy models (MaxEnt)

Generalized Linear Models

Discriminative Naïve Bayes

Very shallow (sigmoidal) neural nets

as statistical regression

a form of

viewed as

based in information theory

to be cool today :)

Learning θ

pθ(y | x ) probabilistic model

objective(given observations)

How will we optimize F(θ)?

Calculus

F(θ)

θ

F(θ)

θθ*

F(θ)

θ

F’(θ)derivative of F

wrt θ

θ*

Example

F’(x) = -2x + 4

F(x) = -(x-2)2

differentiate

Solve F’(x) = 0

x = 2

Common Derivative Rules

F(θ)

θ

F’(θ)derivative of F wrt θ

θ*

What if you can’t find the roots? Follow the derivative

F(θ)

θ

F’(θ)derivative of F wrt θ

θ*

What if you can’t find the roots? Follow the derivative

Set t = 0Pick a starting value θt

Until converged:1. Get value y t = F(θ t)

θ0

y0

F(θ)

θ

F’(θ)derivative of F wrt θ

θ*

What if you can’t find the roots? Follow the derivative

Set t = 0Pick a starting value θt

Until converged:1. Get value y t = F(θ t)2. Get derivative g t = F’(θ t)

θ0

y0

g0

F(θ)

θ

F’(θ)derivative of F wrt θ

θ*

What if you can’t find the roots? Follow the derivative

Set t = 0Pick a starting value θt

Until converged:1. Get value y t = F(θ t)2. Get derivative g t = F’(θ t)3. Get scaling factor ρ t

4. Set θ t+1 = θ t + ρ t *g t

5. Set t += 1θ0

y0

θ1

g0

F(θ)

θ

F’(θ)derivative of F wrt θ

θ*

What if you can’t find the roots? Follow the derivative

Set t = 0Pick a starting value θt

Until converged:1. Get value y t = F(θ t)2. Get derivative g t = F’(θ t)3. Get scaling factor ρ t

4. Set θ t+1 = θ t + ρ t *g t

5. Set t += 1θ0

y0

θ1

y1

θ2

g0

g1

F(θ)

θ

F’(θ)derivative of F wrt θ

θ*

What if you can’t find the roots? Follow the derivative

Set t = 0Pick a starting value θt

Until converged:1. Get value y t = F(θ t)2. Get derivative g t = F’(θ t)3. Get scaling factor ρ t

4. Set θ t+1 = θ t + ρ t *g t

5. Set t += 1θ0

y0

θ1

y1

θ2

y2

y3

θ3

g0

g1 g2

F(θ)

θ

F’(θ)derivative of F wrt θ

θ*

What if you can’t find the roots? Follow the derivative

Set t = 0

Pick a starting value θt

Until converged:

1. Get value y t = F(θ t)2. Get derivative g t = F’(θ t)

3. Get scaling factor ρ t4. Set θ t+1 = θ t + ρ t *g t

5. Set t += 1θ0

y0

θ1

y1

θ2

y2

y3

θ3

g0

g1 g2

Gradient = Multi-variable derivative

K-dimensional input

K-dimensional output

Gradient Ascent

Gradient Ascent

Gradient Ascent

Gradient Ascent

Gradient Ascent

Gradient Ascent

Recommended