Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign...

Conditional Random Fields

Sequence Labeling: The Problem

• Given a sequence (in NLP, words), assign appropriate labels to each word.

• For example, POS tagging:

The cat sat on the mat .DT NN VBD IN DT NN .

• Another example, partial parsing (aka chunking):

The cat sat on the matB-NP I-NP B-VPB-PP B-NP I-NP

• Another example, relation extraction:

The cat sat on the matB-ArgI-ArgB-Rel I-Rel B-Arg I-Arg

The CRF Equation

• A CRF model consists of – F = <f1, …, fk>, a vector of “feature functions”

– θ = < θ1, …, θk>, a vector of weights for each feature function.

• Let O = < o1, …, oT> be an observed sentence

• Let A = <a1, …, aT> be the labent variables.

• This is the same as the Maximum Entropy equation!

OyFθO|yA

,exp)(P

• Note that the denominator depends on O, but not on y (it’s marginalizing over y).

• Typically, we write

CRF Equation, standard format

Oy,FθO

O|yA exp)(

O,yFθO exp)(Z

Making Structured Predictions

Aside: Structured prediction vs. Text Classification

Recall: max. ent. for text classification:

CRFs for sequence labeling:

What’s the difference?

doc,Fθ

doc,Fθdoc

maxarg

1maxarg)(maxarg

Oy,Fθ

Oy,FθO

maxarg

1maxarg)(maxarg

Aside: Structured prediction vs. Text Classification

Two (related) differences, both for the sake of efficiency:

1)Feature functions in CRFs are restricted to graph parts (described later)

2)We can’t do brute force to compute the argmax. Instead, we do Viterbi.

Finding the Best Sequence

Best sequence is

Recall from HMM discussion:If there are

K possible states for each yi variable,

and N total yi variables,

Then there are KN possible settings for ySo brute force can’t find the best sequence. Instead, we resort to a Viterbi-like dynamic program.

Oy,Fθ

Oy,FθO

maxarg

1maxarg)(maxarg

oTo1 otot-1 ot+1

Viterbi Algorithm

),,...,...(max)( 1111... 11

ttttyy

j ojyooyytt

The state sequence which maximizes the score of seeing the observations to time t-1, landing in state j at time t, and seeing the observation at time t

A1 At-1 At=j

oTo1 otot-1 ot+1

Viterbi Algorithm

)(maxargˆ TX ii

)1(ˆ1

)(maxarg)ˆ( TXP ii

Compute the most likely state sequence by working backwards

x1 xt-1 xt xt+1 xT

Viterbi Algorithm

1)(max)1(

tjoijii

j batt

1)(maxarg)1(

tjoijii

j batt Recursive Computation

oTo1 otot-1 ot+1

A1 At-1 At=j At+1

),,...,...(max)( 1111... 11

ttttyy

j ojyooyytt

Feature functions and Graph parts

To make efficient computation (dynamic programs) possible, we restrict the feature functions to:

Graph parts (or just parts): A feature function that counts how often a particular configuration occurs for a clique in the CRF graph.

Clique: a set of completely connected nodes in a graph. That is, each node in the clique has an edge connecting it to every other node in the clique.

Clique Example

The cliques in a linear chain CRF are the set of individual nodes, and the set of pairs of consecutive nodes.

Clique Example

Individual node cliques

Clique Example

Pair-of-node cliques

Clique Example

For non-linear-chain CRFs (something we won’t normally consider in this class), you can get larger cliques:

Larger cliques

Graph part as Feature Function Example

Graph parts are feature functions p(y,x) that count how many cliques have a particular configuration.

For example, p(y,x) = count of [yi = Noun].

Here, y2 and y6 are both Nouns, so p(y,x) = 2.

For a pair-of-nodes example, p(y,x) = count of [yi = Noun,yi+1=Verb]

Here, y2 is a Noun and y3 is a Verb, so p(y,x) = 1.

Features can depend on the whole observation

In a CRF, each feature function can depend on x, in addition to a clique in y

Normally, we draw a CRF like this:

Features can depend on the whole observation

In a CRF, each feature function can depend on x, in addition to a clique in y

But really, it’s more like this:

This would cause problems for a generative model, but in a conditional model, x is always a fixed constant. So we can still calculate relevant algorithms like Viterbi efficiently.

An example part including x: p(y,x) = count of [yi = A or D,yi+1=N,x2=cat]

Here, y1 is a D and y2 is a N, plus y5 is a A and y6 is a N, plus x2=cat, so p(y,x) = 2.

Notice that the clique y5-y6 is allowed to depend on x2.

chased

An more usual example including x: p(y,x) = count of [yi = A or D,yi+1=N,xi+1=cat]

Here, y1 is a D and y2 is a N, plus x2=cat, so p(y,x)=1.

chased

The CRF Equation, with Parts

• A CRF model consists of – P = <p1, …, pk>, a vector of parts

– θ = < θ1, …, θk>, a vector of weights for each part.

• Let O = < o1, …, oT> be an observed sentence

• Let A = <a1, …, aT> be the labent variables.

,exp)(

OyPθO|yA

Viterbi Algorithm – 2nd Try

max)1(

ttpairpair

toneone

Recursive

Computation

oTo1 otot-1 ot+1

A1 At-1 At=j At+1

),,...(max)( 11... 11

oPθ jyyyt ttyy

maxarg)1(

ttpairpair

toneone

Supervised Parameter Estimation

Conditional Training• Given a set of observations o and the correct labels y

for each, determine the best θ:

• Because the CRF equation is just a special form of the maximum entropy equation, we can train it exactly the same way: – Determine the gradient– Step in the direction of the gradient– Repeat until convergence

)(maxarg θo|yθ

Recall: Training a ME model

Training is an optimization problem:find the value for λ that maximizes the conditional log-likelihood of the training data:

Traindc iii

Traindc

dcPTrainCLL

)(log),(

)|(log)(

Optimization is normally performed using some form of gradient descent:0) Initialize λ0 to 0

1) Compute the gradient: ∇CLL2) Take a step in the direction of the gradient:λi+1 = λi + α ∇CLL

3) Repeat until CLL doesn’t improve:stop when |CLL(λi+1) – CLL(λi)| < ε

Computing the gradient:

TraindciPi

Traindcc i

c iiii

Traindc c iii

Traindc iii

dcfdcf

dcfdcfdcf

dcfdcf

dZdcfTrainCLL

),(E),(

),(exp

),(exp),(),(

),(explog),(

)(log),()(

Computing the gradient:

TraindciPi

Traindcc i

c iiii

Traindc c iii

Traindc iii

dcfdcf

dcfdcfdcf

dcfdcf

dZdcfTrainCLL

),(E),(

),(exp

),(exp),(),(

),(explog),(

)(log),()(

The hard part for CRFs

Training a CRF: Expected feature counts

• … (sorry, ran out of time)

CRFs vs. HMMs

Generative (Joint Probability) Models

• HMMs are generative models: That is, they can compute the joint probability P(sentence, hidden-states)

• From a generative model, one can compute– Conditional models P(sentence | hidden-states) and

P(hidden-states| sentence)– Marginal models P(sentence) and P(hidden-states)

• For sequence labeling, we want P(hidden-states | sentence)

Discriminative (Conditional) Models

• Most often, people are most interested in the conditional probability P(hidden-states | sentence)For example, this is the distribution needed for sequence labeling.

• Discriminative (also called conditional) models directly represent the conditional distribution P(hidden-states | sentence)– These models cannot tell you the joint distribution, marginals, or other

conditionals.– But they’re quite good at this particular conditional distribution.

Discriminative vs. GenerativeHMM (generative) CRF (discriminative)

Marginal, orLanguage model:P(sentence)

Forward algorithm or Backward algorithm,

linear in length of sentence

Can’t do it.

Find optimal label sequence

Viterbi,Linear in length of

sentence

Viterbi,Linear in length of

sentence

Supervised parameter estimation

Bayesian learning,Easy and fast

Convex optimization,Can be quite slow

Unsupervised parameter estimation

Baum-Welch (non-convex optimization),

Slow but doable

Very difficult, and requires making extra assumptions.

Feature functions Parents and children in the graph

Restrictive!

Arbitrary functions of a latent state and any

portion of the observed nodes

CRFs vs. HMMs, a closer look

It’s possible to convert an HMM into a CRF:Set pprior,state(y,x) = count[y1=state]Set θprior,state = log PHMM(y1=state) = log state

Set ptrans,state1,state2(y,x)= count[yi=state1,yi+1=state2]Set θtrans,state1,state2 = log PHMM(yi+1=state2|yi=state1)

= log Astate1,state2

Set pobs,state,word(y,x)= count[yi=state,xi=word]Set θobs,state,word = log PHMM(xi=word|yi=state)

= log Bstate,word

CRF vs. HMM, a closer look

If we convert an HMM to a CRF, all of the CRF parameters θ will be logs of probabilities.Therefore, they will all be between –∞ and 0

Notice: CRF parameters can be between –∞ and +∞.

So, how do HMMs and CRFs compare in terms of bias and variance (as sequence labelers)?– HMMs have more bias– CRFs have more variance

Comparing feature functionsThe biggest advantage of CRFs over HMMs is that they can handle

overlapping features.

For example, for POS tagging, using words as a features (like xi=“the” or xj=“jogging”) is quite useful.

However, it’s often also useful to use “orthographic” features, like “the word ends in –ing” or “the word starts with a capital letter.”

These features overlap: some words end in “ing”, some don’t.

• Generative models have to include in the model parameters for predicting when features will overlap.

• Discriminative models don’t: they can simply use the features.

CRF Example

A CRF POS Tagger for English

Vocabulary

We need to determine the set of possible word types V.

Let V = {all types in 1 million tokens of Wall Street Journal text, which we’ll use for training}

U {UNKNOWN} (for word types we haven’t seen)

L = Label Set

Standard Penn Treebank tagsetNumber Tag Description

1. CC Coordinating conjunction

2. CD Cardinal number

3. DT Determiner

4. EX Existential there

5. FW Foreign word

6. IN Preposition or subordinating conjunction

7. JJ Adjective

8. JJR Adjective, comparative

Number Tag Description

9. JJS Adjective, superlative

10. LS List item marker

11. MD Modal

12. NN Noun, singular or mass

13. NNS Noun, plural

14. NNP Proper noun, singular

15. NNPS Proper noun, plural

16. PDT Predeterminer

17. POS Possessive ending

L = Label SetNumber Tag Description

18. PRP Personal pronoun

19. PRP$ Possessive pronoun

20. RB Adverb

21. RBR Adverb, comparative

22. RBS Adverb, superlative

23. RP Particle

24. SYM Symbol

25. TO to

26. UH Interjection

27. VB Verb, base form

28. VBD Verb, past tense

29. VBG Verb, gerund or present participle

Number Tag Description

30. VBN Verb, past participle

31. VBP Verb, non-3rd person singular present

32. VBZ Verb, 3rd person singular present

33. WDT Wh-determiner

34. WP Wh-pronoun

35. WP$ Possessive wh-pronoun

36. WRB Wh-adverb

CRF FeaturesFeature Type Description

Prior k yi = k

Transition k,k’ yi = k and yi+1=k’

Word k,w yi = k and xi=wk,w yi = k and xi-1=wk,w yi = k and xi+1=wk,w,w’ yi = k and xi=w and xi-1=w’k,w,w’ yi = k and xi=w and xi+1=w’

Orthography: Suffix s in {“ing”,”ed”,”ogy”,”s”,”ly”,”ion”,”tion”, “ity”, …} and k yi=k and xi ends with s

Orthography: Punctuation k yi = k and xi is capitalizedk yi = k and xi is hyphenatedk yi = k and xi contains a periodk yi = k and xi is ALL CAPSk yi = k and xi contains a digit (0-9)…

Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign...

Documents

A Rule-Based NLP System in Tagging and Categorizing Phenotype Variables in dbGaP

Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most

NLP for Social Media: POS Tagging, Sentiment Analysiscse.iitkgp.ac.in/~pawang/courses/SC16/nlp_social3.pdf · NLP for Social Media: POS Tagging, Sentiment Analysis Pawan Goyal CSE,

NLP Programming Tutorial 5 - Part of Speech Tagging with

The Perceptron - Simon Suster's Home Pagesimonsuster.github.io/teaching/perceptron.pdf · Extensions: structured perceptron Used often in NLP (tagging, NER, parsing) Given a sentence,

Sequence Classification: Chunking & NER Shallow Processing Techniques for NLP Ling570 November 23, 2011

POS Tagging & Sequence Labeling Tasks

Overview of Machine Learning for NLP Tasks: part II Named Entity Tagging: A Phrase-Level NLP Task

Sanskrit Tag-sets and Part-Of-Speech Tagging Methods - A Survey … · 2018. 6. 22. · Language Processing (NLP) perspective. Part Of Speech (POS) tagging is the most initial step

600.465 - Intro to NLP - J. Eisner1 Part-of-Speech Tagging A Canonical Finite-State Task

Target-Based Sentiment Analysis as a Sequence-Tagging Taskceur-ws.org/Vol-2491/paper46.pdfTarget-Based Sentiment Analysis as a Sequence-Tagging Task 3 for words that indicate the expression

Constraint satisfaction inference for discrete sequence processing in NLP

MaxEnt POS Tagging Shallow Processing Techniques for NLP Ling570 November 21, 2011

CS 671 NLP PARTS-OF-SPEECH TAGGING AND SYNTAX · Babies acquire language by relating phrases with their usage (meanings). ... CS 671 NLP PARTS-OF-SPEECH TAGGING AND SYNTAX Modes of

Categorizing and Tagging Words - 北京大学中国语言学 …ccl.pku.edu.cn/alcourse/nlp/lecturenotes/chapter_10.pdf · · 2012-05-13Categorizing and Tagging Words ... structure

Speech & NLP (Fall 2014): POS Tagging, Sentence Splitting & Parsing

sequence Tagging - University of Texas at Austingdurrett/courses/fa2020/... · 2020. 9. 22. · -sequence tagging-Tagging with classifiers Announcements-[prams in Zoom]-Al-AZ-Survey:

Graphical Models for Sequence Labeling in NLP

Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA 94305-4115, USA email: schuetze@csli.stanford.edu NLP Applications

Part of Speech Tagging - cs.mcgill.cajcheung/teaching/fall-2016/comp599/lectures/... · Outline Parts of speech in English POS tagging as a sequence labelling problem Markov chains