MEMMs/CMMs and CRFs

MEMMs/CMMs and CRFs

William W. Cohen

Sep 22, 2010

ANNOUNCEMENTS…

Wiki Pages - HowTo

• http://malt.ml.cmu.edu/mw/index.php/Social_Media_Analysis_10-802_in_Spring_2010#Other_Resources

• Example: http://malt.ml.cmu.edu/mw/index.php/Turney,_ACL_2002

– Key points• Naming the pages – examples:

– [[Cohen ICML 1995]]

– [[Lin and Cohen ICML 2010]]

– [[Minkov et al IJCAI 2005]]

• Structured links:

– [[AddressesProblem::named entity recognition]]

– [[UsesMethod::absolute discounting]

– [[RelatedPaper::Pang et al ACL 2002]

– [[UsesDataset::Citeseer]]

– [[Category::Paper]]

– [[Category::Problem]]

– [[Category::Method]]

– [[Category::Dataset]]

– Rule of 2: Don’t create a page unless you expect 2 inlinks• A method from a paper that’s not used anywhere else should be described in-line

– No inverse links – but you can emulate these with queries

3

http://malt.ml.cmu.edu/mw/index.php/Social_Media_Analysis_10-802_in_Spring_2010#Other_Resources

http://malt.ml.cmu.edu/mw/index.php/Turney,_ACL_2002

Wiki Pages – HowTo, con’t

• To turn in:– Add them to the wiki

– Add links to them on your user page

– Send me an email with links to each page you want to get graded on

– [I may send back bug reports until people get the hang of this…]

• WhenTo: Three pages by 9/30 at midnight

– Actually 10/1 at dawn is fine.

• Suggestion: – Think of your project and build pages for the dataset, the problem, and the

(baseline) method you plan to use.

4

5

Projects

• Some sample projects– Apply existing method to a new problem

• http://staff.science.uva.nl/~otsur/papers/sarcasmAmazonICWSM10.pdf

– Apply new method to an existing dataset– Build something that might help you in your research

• E.g., Extract names of people (pundits, politicians, …) from political blogs• Classify folksonomy tags as person names, place names, …

• On Wed 9/29 - “turn in”:– One page, covering some subset of:

• What you plan to do with what data• Why you think it’s interesting• Any relevant superpowers you might have• How you plan to evaluate• What techniques you plan to use• What question you want to answer• Who you might work with

– These will be posted on the class web site• On Friday 10/8:

– Similar abstract from each team• Team is (preferably) 2-3 people, but I’m flexible• Main new information: who’s on what team

http://staff.science.uva.nl/~otsur/papers/sarcasmAmazonICWSM10.pdf

6

Conditional Markov Models

7

What is a symbol?

Ideally we would like to use many, arbitrary, overlapping features of words.

St -1

St

Ot

St+1

Ot +1

Ot -1

identity of word

ends in “-ski”

is capitalized

is part of a noun phrase

is in a list of city names

…

…

…part of

noun phrase

is “Wisniewski”

ends in

“-ski”

Lots of learning systems are not confounded by multiple, non-independent features: decision trees, neural nets, SVMs, …

Stupid HMM tricks

startPr(red)

Pr(green)Pr(green|green) = 1

Pr(red|red) = 1

Pr(y|x) = Pr(x|y) * Pr(y) / Pr(x)

argmax{y} Pr(y|x) = argmax{y} Pr(x|y) * Pr(y)

= argmax{y} Pr(y) * Pr(x1|y)*Pr(x2|y)*...*Pr(xm|y)

Pr(“I voted for Ralph Nader”|ggggg) =

Pr(g)*Pr(I|g)*Pr(voted|g)*Pr(for|g)*Pr(Ralph|g)*Pr(Nader|g)

9

From NB to Maxent

Zy

yw

docf

docf

k

i

kj

/)Pr(

)|Pr(

ncombinatiok j,th -i )(

0]:doc?1 of jposition at appearsk [word )(,

xjw

ywyZ

xy

k

jk

in wordis where

)|Pr()Pr(1

)|Pr( i

xfi )(0

10

From NB to Maxent

xjw

ywyZ

xy

k

jk

in wordis where

)|Pr()Pr(1

)|Pr( i

xfi )(0

11

What is a symbol?

St -1 S

t

Ot

St+1

Ot +1

Ot -1

identity of word

ends in “-ski”

is capitalized

is part of a noun phrase

is in a list of city names

is under node X in WordNet

is in bold font

is indented

is in hyperlink anchor

…

…

…part of

noun phrase

is “Wisniewski”

ends in

“-ski”

Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state history

......),|Pr( ,2,1 tttt ssxs

12

Ratnaparkhi’s MXPOST

• Sequential learning problem: predict POS tags of words.

• Uses MaxEnt model described above.

• Rich feature set.

• To smooth, discard features occurring < 10 times.

13

MXPOST

14

Inference for MENE

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

When will prof Cohen post the notes …

15

Inference for MXPOST

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O


),|Pr(

)...,,|Pr(

)...,,|Pr()|Pr(

1

1,

1,1

iii

iikii

iii

yxy

yyxy

yyxyxy

(Approx view): find best path, weights are now on arcs from state to state.

16


B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O


More accurately: find total flow to each node, weights are now on arcs from state to state.

'

11 )',|Pr()'()(y

tttt yYxyYyy

17


B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O


),|Pr(

)...,,|Pr(

)...,,|Pr()|Pr(

1,2

1,

1,1

iiii

iikii

iii

yyxy

yyxy

yyxyxy

Find best path? tree? Weights are on hyperedges

18

Inference for MxPOST

I

O

iI

iO


oI

oO

Beam search is alternative to Viterbi:

at each stage, find all children, score them, and discard all but the top n states

19


I

O

iI

iO


oI

oO



oII

oiO

ioI

ioO

ooI

ooO

20


I

O

iI

iO


oI

oO



oiI

oiO

ioI

ioO

ooI

ooO

oiiI

oiiO

iooI

iooO

oooI

oooO

21

MXPost results

• State of art accuracy (for 1996)

• Same approach used successfully for several other sequential classification steps of a stochastic parser (also state of art).

• Same (or similar) approaches used for NER by Borthwick, Malouf, Manning, and others.

Frietag, McCallum, Pereira

23

MEMMs

• Basic difference from ME tagging:– ME tagging: previous state is feature of MaxEnt classifier– MEMM: build a separate MaxEnt classifier for each state.

• Can build any HMM architecture you want; eg parallel nested HMM’s, etc.

• Data is fragmented: examples where previous tag is “proper noun” give no information about learning tags when previous tag is “noun”

– Mostly a difference in viewpoint– MEMM does allow possibility of “hidden” states and Baum-

Welsh like training

– Viterbi is the most natural inference scheme

24

MEMM task: FAQ parsing

25

MEMM features

26

MEMMs

27

Conditional Random Fields

Implications of the MEMM model

• Does this do what we want?• Q: does Y[i-1] depend on X[i+1] ?

– “a nodes is conditionally independent of its non-descendents given its parents”

• Q: what is Y[0] for the sentence “Qbbzzt of America Inc announced layoffs today in …”


B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O


),|Pr(

)...,,|Pr(

)...,,|Pr()|Pr(

1

1,

1,1

iii

iikii

iii

yxy

yyxy

yyxyxy

(Approx view): find best path, weights are now on arcs from state to state.


B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O



'

11 )',|Pr()'()(y

tttt yYxyYyy

Flow out of a node is always fixed:

y

tt yYxyYy 1)',|Pr(,' 1

Label Bias Problem

• Consider this MEMM, and enough training data to perfectly model it:

Pr(0123|rob) = Pr(1|0,r)/Z1 * Pr(2|1,o)/Z2 * Pr(3|2,b)/Z3= 0.5 * 1 * 1

Pr(0453|rib) = Pr(4|0,r)/Z1’ * Pr(5|4,i)/Z2’ * Pr(3|5,b)/Z3’= 0.5 * 1 *1

Pr(0123|rib)=1

Pr(0453|rob)=1

How important is label bias?

• Could be avoided in this case by changing structure:

• Our models are always wrong – is this “wrongness” a problem?

• See Klein & Manning’s paper for more on this….

Another view of label bias [Sha & Pereira]

So what’s the alternative?


B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O



'

11 )',|Pr()'()(y

tttt yYxyYyy


y


Another max-flow scheme

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O



'

11 )',|Pr()'()(y

tttt yYxyYyy


y


Another max-flow scheme: MRFs

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O


Goal is to learn how to weight edges in the graph:

• weight(yi,yi+1) = 2*[(yi=B or I) and isCap(xi)] + 1*[(yi=B and

isFirstName(xi)] - 5*[(yi+1≠B and isLower(xi) and isUpper(xi+1)]

Another max-flow scheme: MRFs

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O


Find total flow to each node, weights are now on edges from state to state. Goal is to learn how to weight edges in the graph, given features from the examples.

CRFs vs MEMMs

• MEMMs:– Sequence classification

f:xy is reduced to many cases of ordinary classification, f:xiyi

– …combined with Viterbi or beam search

• CRFs:– Sequence classification

f:xy is done by:• Converting x,Y to a MRF

• Using “flow” computations on the MRF to compute the best y|x

x1 x2 x3 x4 x5 x6

Pr(Y|x4,y3)Pr(Y|x5,y5)Pr(Y|x2,y1)

Pr(Y|x2,y1’)

y1 y2 y3 y4 y5 y6

…

… …

x1 x2 x3 x4 x5 x6

MRF: φ(Y1,Y2), φ(Y2,Y3),….

y1 y2 y3 y4 y5 y6

The math: Review of maxent

'

)(0

))',(exp(

)),(exp()|Pr(

)),(exp(),Pr(

))(exp()Pr(

y iii

iii

iii

iii

i

xf

yxf

yxfxy

yxfyx

xfx i

Review of maxent/MEMM/CMMs

j j

ijjjii

jjjjnn

iii

y iii

iii

xZ

yyxfxyyxxyy

xZ

yxf

yxf

yxfxy

)(

)),,(exp()|Pr()...|...Pr(

:MEMMfor

)(

)),(exp(

))',(exp(

)),(exp()|Pr(

1

,111

'

We know how to compute this.

Details on CMMs

j j

ijjjii

jjjjnn xZ

yyxfxyyxxyy

)(

)),,(exp()|Pr()...|...Pr(

1

,111

jjjjii

jj

ijjjii

jj

ijjjii

j

yyxfyxFxZ

yyxF

xZ

yyxf

),,(),( where,)(

)),,(exp(

)(

)),,(exp(

1

1

1

From CMMs to CRFs

jjjjii

jj

iii

jj

ijjjii

j

yyxfyxFxZ

yxF

xZ

yyxf

),,(),( where,)(

)),(exp(

)(

)),,(exp(

1

1

Recall why we’re unhappy: we don’t want local normalization

)(

)),(exp(

xZ

yxFi

ii

New model How to compute this?

What’s the new model look like?

)(

),,(exp(

)(

)),(exp( 1

xZ

yyxf

xZ

yxFi j

jjjii

iii

x1 x2 x3

y1 y2 y3

What’s independent? If fi is HMM-like and depends on only xj,yj or yj,yj-1

What’s the new model look like?

)(

),,(exp(

)(

)),(exp( 1

xZ

yyxf

xZ

yxFi j

jjii

iii

x

y1 y2 y3

What’s independent now??

CRF learning – from Sha & Pereira



Something like forward-backward

Idea:

• Define matrix of y,y’ “affinities” at stage i

• Mi[y,y’] = “unnormalized probability” of transition from y to y’ at stage I

• Mi * Mi+1 = “unnormalized probability” of any path through stages i and i+1

x

y1 y2 y3

y1 y2 y3

Forward backward ideas

name

nonName

name

nonName

name

nonName

a

b c

d

e

f g

h

......

bhafbgae

hg

fe

dc

ba


Sha & Pereira results

CRF beats MEMM (McNemar’s test); MEMM probably beats voted perceptron

Sha & Pereira results

in minutes, 375k examples

Klein & Manning: Conditional Structure vs Estimation

Task 1: WSD (Word Sense Disambiguation)

Bush’s election-year ad campaign will begin this summer, with... (sense1)

Bush whacking is tiring but rewarding—who wants to spend all their time on marked trails? (sense2)

Class is sense1/sense2, features are context words.


Model 1: Naive Bayes multinomial model:

Use conditional rule to predict sense s from context-word observations o. Standard NB training maximizes “joint likelihood” under independence assumption


Model 2: Keep same functional form, but maximize conditional likelihood (sound familiar?)

or maybe SenseEval score:

or maybe even:

In other words…

else0

if1),( where

}{}{ featureson depends )|Pr(

,,

,,

yYxXYXf

ffXY

jyxj

yxji

),(exp( )|Pr( YXfXYi

ii

Naïve Bayes MaxEnt

)|Pr(,, yYxX jyxj )X|Pr(Y maximize

chosen to tt

i

Different “optimization goals”…

… or, dropping a constraint about f’s and λ’s


• Optimize JL with std NB learning• Optimize SCL, CL with conjugate gradient

– Also over “non-deficient models” (?) using Lagrange penalties to enforce “soft” version of deficiency constraint

– I think this makes sure non-conditional version is a valid probability

• “Punt” on optimizing accuracy• Penalty for extreme predictions in SCL

Conclusion: maxent beats NB?All generalizations are wrong?

Task 2: POS Tagging

• Sequential problem• Replace NB with HMM model.• Standard algorithms maximize joint likelihood

• Claim: keeping the same model but maximizing conditional likelihood leads to a CRF– Is this true?

• Alternative is conditional structure (CMM)

else0

' if1),( and

else0

if1),( where

}{}{}{ featureson depends )|Pr(

1',,

,,

',,,,

yYyYYXf

yYxXYXf

fffXY

jjyyj

t

jjyxj

s

yyjt

yxjs

i

)),(exp( )|Pr(||

1

YXfXYi

ii

X

j

HMM CRF

)|'Pr(

)|Pr(

1',,

,,

yYyY

yYxX

jjyyj

jjyxj

)X|YPr( maximize

chosen to tt

i

)),(exp( )|Pr( is,that ,*,*

,*,*,*,*

||

1

YXfXYji

jj

X

j

Using conditional structure vs maximizing conditional likelihood

CMM factors Pr(s,o) into Pr(s|o)Pr(o).

For the CMM model, adding dependencies btwn observations does not change Pr(s|o), ie JL estimate =CL estimate for Pr(s|o)

Task 2: POS Tagging

Experiments with a simple feature set:

For fixed model, CL is preferred to JL (CRF beats HMM)

For fixed objective, HMM is preferred to MEMM/CMM

Error analysis for POS tagging

• Label bias is not the issue:– state-state dependencies are weak compared to

observation-state dependencies– too much emphasis on observation, not enough

on previous states (“observation bias”)

– put another way: label bias predicts overprediction of states with few outgoing transitions, of more generally, low entropy...

Error analysis for POS tagging

Documents

MEMMs/CMMs and CRFs