12
12/29/13 4:31 PM Introduction to Conditional Random Fields - Edwin Chen's Blog Page 1 of 12 http://blog.echen.me/2012/01/03/introduction-to-conditional-random-fields/ Edwin Chen's Blog Edwin Chen's Blog RSS Search Navigate… Blog Archives Introduction to Conditional Random Fields Introduction to Conditional Random Fields Jan 3rd, 2012 Imagine you have a sequence of snapshots from a day in Justin Bieber’s life, and you want to label each image with the activity it represents (eating, sleeping, driving, etc.). How can you do this? One way is to ignore the sequential nature of the snapshots, and build a per-image classifier. For example, given a month’s worth of labeled snapshots, you might learn that dark images taken at 6am tend to be about sleeping, images with lots of bright colors tend to be about dancing, images of cars are about driving, and so on. By ignoring this sequential aspect, however, you lose a lot of information. For example, what happens if you see a close-up picture of a mouth – is it about singing or eating? If you know that the previous image is a picture of Justin Bieber eating or cooking, then it’s more likely this picture is about eating; if, however, the previous image contains Justin Bieber singing or dancing, then this one probably shows him singing as well. Thus, to increase the accuracy of our labeler, we should incorporate the labels of nearby photos, and this is precisely what a conditional random field conditional random field does. Part-of-Speech Tagging Part-of-Speech Tagging Let’s go into some more detail, using the more common example of part-of-speech tagging part-of-speech tagging. In POS tagging, the goal is to label a sentence (a sequence of words or tokens) with tags like ADJECTIVE, NOUN, PREPOSITION, VERB, ADVERB, ARTICLE.

Introduction to Conditional Random Fields

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to Conditional Random Fields

12/29/13 4:31 PMIntroduction to Conditional Random Fields - Edwin Chen's Blog

Page 1 of 12http://blog.echen.me/2012/01/03/introduction-to-conditional-random-fields/

Edwin Chen's BlogEdwin Chen's Blog

RSS

Search

Navigate…

BlogArchives

Introduction to Conditional Random FieldsIntroduction to Conditional Random Fields

Jan 3rd, 2012

Imagine you have a sequence of snapshots from a day in Justin Bieber’s life, and you want tolabel each image with the activity it represents (eating, sleeping, driving, etc.). How can you dothis?

One way is to ignore the sequential nature of the snapshots, and build a per-image classifier. Forexample, given a month’s worth of labeled snapshots, you might learn that dark images taken at6am tend to be about sleeping, images with lots of bright colors tend to be about dancing, imagesof cars are about driving, and so on.

By ignoring this sequential aspect, however, you lose a lot of information. For example, whathappens if you see a close-up picture of a mouth – is it about singing or eating? If you know thatthe previous image is a picture of Justin Bieber eating or cooking, then it’s more likely thispicture is about eating; if, however, the previous image contains Justin Bieber singing or dancing,then this one probably shows him singing as well.

Thus, to increase the accuracy of our labeler, we should incorporate the labels of nearby photos,and this is precisely what a conditional random fieldconditional random field does.

Part-of-Speech TaggingPart-of-Speech Tagging

Let’s go into some more detail, using the more common example of part-of-speech taggingpart-of-speech tagging.

In POS tagging, the goal is to label a sentence (a sequence of words or tokens) with tags likeADJECTIVE, NOUN, PREPOSITION, VERB, ADVERB, ARTICLE.

Page 2: Introduction to Conditional Random Fields

12/29/13 4:31 PMIntroduction to Conditional Random Fields - Edwin Chen's Blog

Page 2 of 12http://blog.echen.me/2012/01/03/introduction-to-conditional-random-fields/

For example, given the sentence “Bob drank coffee at Starbucks”, the labeling might be “Bob(NOUN) drank (VERB) coffee (NOUN) at (PREPOSITION) Starbucks (NOUN)”.

So let’s build a conditional random field to label sentences with their parts of speech. Just likeany classifier, we’ll first need to decide on a set of feature functions .

Feature Functions in a CRFFeature Functions in a CRF

In a CRF, each feature functionfeature function is a function that takes in as input:

a sentence sthe position i of a word in the sentencethe label of the current wordthe label of the previous word

and outputs a real-valued number (though the numbers are often just either 0 or 1).

(Note: by restricting our features to depend on only the current and previous labels, rather thanarbitrary labels throughout the sentence, I’m actually building the special case of a linear-chainlinear-chainCRFCRF. For simplicity, I’m going to ignore general CRFs in this post.)

For example, one possible feature function could measure how much we suspect that the currentword should be labeled as an adjective given that the previous word is “very”.

Features to ProbabilitiesFeatures to Probabilities

Next, assign each feature function a weightweight (I’ll talk below about how to learn theseweights from the data). Given a sentence s, we can now score a labeling l of s by adding up theweighted features over all words in the sentence:

(The first sum runs over each feature function , and the inner sum runs over each position ofthe sentence.)

Finally, we can transform these scores into probabilities between 0 and 1 byexponentiating and normalizing:

Example Feature FunctionsExample Feature Functions

fi

lili−1

fj λj

score(l|s) = (s, i, , )∑mj=1 ∑n

i=1 λjfj li li−1

j i

p(l|s)

p(l|s) = =exp[score(l|s)]exp[score( |s)]∑l′ l′

exp[ (s,i, , )]∑mj=1∑n

i=1λjfj li li−1exp[ (s,i, , )]∑l′ ∑m

j=1∑ni=1λjfj l′i l′i−1

Page 3: Introduction to Conditional Random Fields

12/29/13 4:31 PMIntroduction to Conditional Random Fields - Edwin Chen's Blog

Page 3 of 12http://blog.echen.me/2012/01/03/introduction-to-conditional-random-fields/

So what do these feature functions look like? Examples of POS tagging features could include:

if ADVERB and the ith word ends in “-ly”; 0 otherwise.If the weight associated with this feature is large and positive, then this feature isessentially saying that we prefer labelings where words ending in -ly get labeled asADVERB.

if , VERB, and the sentence ends in a question mark; 0otherwise.

Again, if the weight associated with this feature is large and positive, thenlabelings that assign VERB to the first word in a question (e.g., “Is this a sentencebeginning with a verb?”) are preferred.

if ADJECTIVE and NOUN; 0 otherwise.Again, a positive weight for this feature means that adjectives tend to be followed bynouns.

if PREPOSITION and PREPOSITION.A negative weight for this function would mean that prepositions don’t tend tofollow prepositions, so we should avoid labelings where this happens.

And that’s it! To sum up: to build a conditional random field, you just define a bunch of featurefunctions (which can depend on the entire sentence, a current position, and nearby labels), assignthem weights, and add them all together, transforming at the end to a probability if necessary.

Now let’s step back and compare CRFs to some other common machine learning techniques.

Smells like Logistic Regression…Smells like Logistic Regression…

The form of the CRF probabilities might look familiar.

That’s because CRFs are indeed basically the sequential version of logistic regressionlogistic regression: whereaslogistic regression is a log-linear model for classification, CRFs are a log-linear model forsequential labels.

Looks like HMMs…Looks like HMMs…

Recall that Hidden Markov ModelsHidden Markov Models are another model for part-of-speech tagging (andsequential labeling in general). Whereas CRFs throw any bunch of functions together to get alabel score, HMMs take a generative approach to labeling, defining

where

(s, i, , ) = 1f1 li li−1 =liλ1

(s, i, , ) = 1f2 li li−1 i = 1 =li

λ2

(s, i, , ) = 1f3 li li−1 =li−1 =li

(s, i, , ) = 1f4 li li−1 =li−1 =liλ4

p(l|s) =exp[ (s,i, , )]∑m

j=1∑ni=1fj li li−1

exp[ (s,i, , )]∑l′ ∑mj=1∑n

i=1fj l′i l′i−1

p(l, s) = p( ) p( | )p( | )l1 ∏i li li−1 wi li

p( | )i i−1

Page 4: Introduction to Conditional Random Fields

12/29/13 4:31 PMIntroduction to Conditional Random Fields - Edwin Chen's Blog

Page 4 of 12http://blog.echen.me/2012/01/03/introduction-to-conditional-random-fields/

are transitiontransition probabilities (e.g., the probability that a preposition is followedby a noun);

are emissionemission probabilities (e.g., the probability that a noun emits the word “dad”).

So how do HMMs compare to CRFs? CRFs are more powerful – they can model everythingHMMs can and more. One way of seeing this is as follows.

Note that the log of the HMM probability is . This has exactly the log-linear

form of a CRF if we consider these log-probabilities to be the weights associated to binarytransition and emission indicator features.

That is, we can build a CRF equivalent to any HMM by…

For each HMM transition probability , define a set of CRF transitionfeatures of the form if and . Give each feature aweight of .Similarly, for each HMM emission probability , define a set of CRFemission features of the form if and . Give each featurea weight of .

Thus, the score computed by a CRF using these feature functions is precisely proportionalto the score computed by the associated HMM, and so every HMM is equivalent to some CRF.

However, CRFs can model a much richer set of label distributions as well, for two main reasons:

CRFs can define a much larger set of features.CRFs can define a much larger set of features. Whereas HMMs are necessarily localin nature (because they’re constrained to binary transition and emission feature functions,which force each word to depend only on the current label and each label to depend only onthe previous label), CRFs can use more global features. For example, one of the features inour POS tagger above increased the probability of labelings that tagged the first word of asentence as a VERB if the end of the sentence contained a question mark.CRFs can have arbitrary weights.CRFs can have arbitrary weights. Whereas the probabilities of an HMM must satisfycertain constraints (e.g., , the weights of aCRF are unrestricted (e.g., can be anything it wants).

Learning WeightsLearning Weights

Let’s go back to the question of how to learn the feature weights in a CRF. One way is (surprise)to use gradient ascentgradient ascent.

Assume we have a bunch of training examples (sentences and associated part-of-speech labels).Randomly initialize the weights of our CRF model. To shift these randomly initialized weights to

p( | )li li−1

p( | )wi li

log p(l, s) = log p( ) + log p( | ) + log p( | )l0 ∑i li li−1 ∑i wi li

p( = y| = x)li li−1(s, i, , ) = 1fx,y li li−1 = yli = xli−1

= log p( = y| = x)wx,y li li−1p( = z| = x)wi li

(s, i, , ) = 1gx,y li li−1 = zwi = xli

= log p( = z| = x)wx,z wi li

p(l|s)

0 <= p( | ) <= 1, p( = w| ) = 1)wi li ∑w wi l1log p( | )wi li

Page 5: Introduction to Conditional Random Fields

12/29/13 4:31 PMIntroduction to Conditional Random Fields - Edwin Chen's Blog

Page 5 of 12http://blog.echen.me/2012/01/03/introduction-to-conditional-random-fields/

the correct ones, for each training example…

Go through each feature function , and calculate the gradient of the log probability of thetraining example with respect to :

Note that the first term in the gradient is the contribution of feature under the true label,and the second term in the gradient is the expected contribution of feature under thecurrent model. This is exactly the form you’d expect gradient ascent to take.Move in the direction of the gradient:

where is somelearning rate.Repeat the previous steps until some stopping condition is reached (e.g., the updates fallbelow some threshold).

In other words, every step takes the difference between what we want the model to learn and themodel’s current state, and moves in the direction of this difference.

Finding the Optimal LabelingFinding the Optimal Labeling

Suppose we’ve trained our CRF model, and now a new sentence comes in. How do we do label it?

The naive way is to calculate for every possible labeling l, and then choose the label thatmaximizes this probability. However, since there are possible labels for a tag set of size k anda sentence of length m, this approach would have to check an exponential number of labels.

A better way is to realize that (linear-chain) CRFs satisfy an optimal substructure property thatallows us to use a (polynomial-time) dynamic programming algorithm to find the optimal label,similar to the Viterbi algorithm for HMMs.

A More Interesting ApplicationA More Interesting Application

Okay, so part-of-speech tagging is kind of boring, and there are plenty of existing POS taggers outthere. When might you use a CRF in real life?

Suppose you want to mine Twitter for the types of presents people received for Christmas:

fiλi

log p(l|s) = (s, j, , ) − p( |s) (s, j, , )∂∂wj

∑mj=1 fi lj lj−1 ∑l′ l′ ∑m

j=1 fi l′j l′

j−1fi

fi

λi

= + α[ (s, j, , ) − p( |s) (s, j, , )]λi λi ∑mj=1 fi lj lj−1 ∑l′ l′ ∑m

j=1 fi l′j l′

j−1 α

λi

p(l|s)km

Page 6: Introduction to Conditional Random Fields

12/29/13 4:31 PMIntroduction to Conditional Random Fields - Edwin Chen's Blog

Page 6 of 12http://blog.echen.me/2012/01/03/introduction-to-conditional-random-fields/

What people on Twitter wanted for Christmas, and what they got: 8:48 PM - 1 Jan 2012

Edwin Chen @echen

FollowFollow

7 RETWEETS 2 FAVORITES

(Yes, I just embedded a tweet. BOOM.)

How can you figure out which words refer to gifts?

To gather data for the graphs above, I simply looked for phrases of the form “I want XXX forChristmas” and “I got XXX for Christmas”. However, a more sophisticated CRF variant could usea GIFT part-of-speech-like tag (even adding other tags like GIFT-GIVER and GIFT-RECEIVER, to get even more information on who got what from whom) and treat this like aPOS tagging problem. Features could be based around things like “this word is a GIFT if theprevious word was a GIFT-RECEIVER and the word before that was ‘gave’” or “this word is aGIFT if the next two words are ‘for Christmas’”.

FinFin

I’ll end with some more random thoughts:

I explicitly skipped over the graphical models framework that conditional random fields sitin, because I don’t think they add much to an initial understanding of CRFs. But if you’reinterested in learning more, Daphne Koller is teaching a free, online course on graphicalmodels starting in January.Or, if you’re more interested in the many NLP applications of CRFs (like part-of-speechtagging or named entity extraction), Manning and Jurafsky are teaching an NLP class inthe same spirit.

Page 7: Introduction to Conditional Random Fields

12/29/13 4:31 PMIntroduction to Conditional Random Fields - Edwin Chen's Blog

Page 7 of 12http://blog.echen.me/2012/01/03/introduction-to-conditional-random-fields/

I also glossed a bit over the analogy between CRFs:HMMs and Logistic Regression:NaiveBayes. This image (from Sutton and McCallum’s introduction to conditional random fields)sums it up, and shows the graphical model nature of CRFs as well:

Posted by Edwin Chen Jan 3rd, 2012

TweetTweet 47 23

« Winning the Netflix Prize: A Summary Quick Introduction to ggplot2 »

CommentsComments

20 comments

WHAT'S THIS?AROUND THE WEB

10 Reasons Why Working in an Open OfficeSucks Fast Company

8 Fees You Need To Stop Paying Right NowWise Bread

10 Coolest Kitchen Gadgets You Don't OwnReviewed

Outrage Ignites Over Obamacare Taxes Money Morning

Join the discussion…

7

5 people like this.LikeLike ShareShare

Page 8: Introduction to Conditional Random Fields

12/29/13 4:31 PMIntroduction to Conditional Random Fields - Edwin Chen's Blog

Page 8 of 12http://blog.echen.me/2012/01/03/introduction-to-conditional-random-fields/

Best Community Login Share

• Reply •

Peter Charles • a year ago

I've got a problem with CRFs:You have added a figure which shows the analogy between different models (Figure 1.2 fromSutton and McCallum). please look at the graph of linear-chain CRFs which is the model youhave described perfectly in this tutorial. This is actually a factor graph and shows the cliques (andthus the arguments of local (feature) functions). As you can see these cliques do not match thefeature functions definition you have given in this tutorial. According to your definition (which Ihave seen in many papers) each feature function is a function of the whole observation sequenceand two consecutive labels. But the graph says that in linear-chain CRFs, each feature function isa function of two consecutive labels and the corresponding observations of those labels (not allthe observation sequence), which is like HMMs. I expect a graph like Figure 6 of Klingr andTomanek CRF tutorial which I have attached to this comment. Its really making me confused andnervous. Could you help please ?Thanks so much.

15

• Reply •

Fabrizio • 2 months ago

Nice, simple and useful introduction to CRF, thanks a lot.

1

• Reply •

Sagara • 3 months ago

Thanks..!!

1

• Reply •

Aya • 8 months ago

Very useful, thank you!

1

• Reply •

Boyi Shafie • 11 months ago

Simple enough to get a basic understanding of CRF. Thank You.

1

Share ›

Share ›

Share ›

Share ›

Share ›

Page 9: Introduction to Conditional Random Fields

12/29/13 4:31 PMIntroduction to Conditional Random Fields - Edwin Chen's Blog

Page 9 of 12http://blog.echen.me/2012/01/03/introduction-to-conditional-random-fields/

• Reply •

Manisha verma • a year ago

simply awesome !!! i havent come accross a simpler explanation, thanks a ton

1

• Reply •

Hossein Hadian • a year ago

That was a great tutorial. Thank you so much.

1

• Reply •

Ousanee S • a year ago

Thank you so much. I'm going to present about this topic soon.

1

• Reply •

Futurecrew • 2 years ago

Amazing!! I like this post

1

• Reply •

Buy Essays • 2 years ago

terrific material. Thanks to you for the fascinating debate. I like the points reviewed.

1

• Reply •

Non-Guest • 2 years ago

Great post!1 typo: it should be p(w_i|l_i), not p(w_i|l_{i-1}), in the HMM emission probability.

1

• Reply •

Edwin • 2 years ago Non-Guest

Good catch, thanks!

1

• Reply •

Chris • 2 years ago

Awesome post. I have a question. In 'Features to Probabilities', you make the score intoprobabilities by exponentiating and normalizing. What is the purpose of exponentiatingeverything? Wouldn't simple normalization suffice?

1

rrenaud • 2 years ago Chris

Exponentiating the scores has the nice property that the ratio of probability that you

Share ›

Share ›

Share ›

Share ›

Share ›

Share ›

Share ›

Share ›

Page 10: Introduction to Conditional Random Fields

12/29/13 4:31 PMIntroduction to Conditional Random Fields - Edwin Chen's Blog

Page 10 of 12http://blog.echen.me/2012/01/03/introduction-to-conditional-random-fields/

• Reply •

assign to score1 and score2 depends only on their ratio, and this ratio independent of allother scores. Further, it gracefully handles negative scores.

Consider score1 = 1, score2 = 2, score3 = -5.

How do you turn these scores into a probability? You can add some number X to eachscore, say, add the minimum.

Then, score1' = 6, score2' = 7, and score3 = 0. Now the ratio between score1 and score2 is6/7. If score3 was initially 0, the ratio would have been 1/2, which is considerablydifferent.

2

• Reply •

Edwin Chen • 2 years ago Chris

+1 to what rrenaud said. (Robert, what do you mean by 'the ratio of probability that youassign to score1 and score2 depends only on their ratio', though?)

Before the transformation, the weights and features can be any real-valued numbers (i.e.,from negative infinity to positive infinity). So first we make them positive (exponentiatingis one way of doing this), and then we normalize to restrict them between 0 and 1 (therebygetting probabilities).

1

• Reply •

medcl • 2 years ago

great post! 1

• Reply •

Edwin Chen • 2 years ago medcl

Thanks!

1

Joni • 2 years ago

There are a lot of confusing elements in this post.1. How do the feature function come to be?2. Even the computation of a single p(l|s) is a hard computational task because the summationover the l' in the denominator has k^m elements. 3. That is why the gradient step you wrote there is not feasible. 4. Also, what is that w_j in that step? 5. Saying that crfs are better than hmm because they have arbitrary weights is weird. I canrepresent my probabilities in the hmm in log scale too and boom - now I increased the domain ofmy parameters. Also, why does it make one model better than the other? So I think this point is

Share ›

Share ›

Share ›

Share ›

Page 11: Introduction to Conditional Random Fields

12/29/13 4:31 PMIntroduction to Conditional Random Fields - Edwin Chen's Blog

Page 11 of 12http://blog.echen.me/2012/01/03/introduction-to-conditional-random-fields/

• Reply •

my parameters. Also, why does it make one model better than the other? So I think this point isnot relevant. 6. In my understanding, what you described is not a linear chain crf. You are confusions twoaspects of problem: the model and the objective. You can have for the same model two differentobjectives - the generative objective p(l,s) and the conditional objective p(l|s). That is the onlydifference between logistic regression and naive bayes, and that is the only difference betweenHMM and a linear chain CRF. What you described is (to my understanding) not a linear chainCRF because the factors you used for the CRF do not exist in the hmm.

1 1

• Reply •

Edwin Chen • 2 years ago Joni

Sorry, I was trying to keep this post both 1) understandable and easy-to-read, yet whilebeing 2) not entirely devoid of content, and probably made the wrong tradeoffs indeciding what to skip over.

1. Yeah, this is something I didn't detail. Obviously, one way is to just define all thefeatures yourself (e.g., throw all binary indicator features of the form "1 if this label is fooand this word is bar" into the model, or look at examples to figure out what features touse). Other ways, I guess, could include the feature induction stuff that McCallum workson.

2, 3. Sorry, I skipped over this. You can avoid the exponential summation by computingthis like a matrix multiplication instead. I'll try adding a note.

4. Not sure what step you're referring to, but probably "word i" (using the POS taggingexample where the visible states are words).

5. It's not so much that each individual weight is arbitrary, but rather that the weightstogether are unrestricted. In HMMs, each weight is explicitly a probability, and they haveto satisfy probabilistic constraints. (Yes, you can represent your probabilities in a differentscale, but the constraints are still there, just in a transformed version.)

6. Not sure what you mean. Yes, in some sense that's the only difference between HMMsand CRFs, but now that you don't have to explicitly include p(s) in your model, don't youwant to add features that you couldn't before? I guess you can imagine what I described asan extension of a super basic linear-chain CRF in which labels depend only on the currentword (and not the entire sentence), though I think this is still commonly called a linear-chain CRF.

1

Reply

Pio • a month ago

Clean and simple explanation without unnecessary math.

Share ›

Share ›

Share ›

Page 12: Introduction to Conditional Random Fields

12/29/13 4:31 PMIntroduction to Conditional Random Fields - Edwin Chen's Blog

Page 12 of 12http://blog.echen.me/2012/01/03/introduction-to-conditional-random-fields/

• Reply •

Subscribe Add Disqus to your site

Share ›

Edwin ChenEdwin Chen

Quant at Google/YouTube. Previously math and linguistics at MIT, research in NLP andcomplexity theory, speech recognition at Microsoft Research, quantitative trading at ClariumCapital, ads quality at Twitter.

Email: hello[at]echen.meTwitter: @echenOther: Github, Google+, LinkedIn, Quora

Recent PostsRecent Posts

Improving Twitter search with real-time human computationEdge Prediction in a Social Graph: My Solution to Facebook's User RecommendationContest on KaggleSoda vs. Pop with TwitterMaking the Most of Mechanical Turk: Tips and Best PracticesInfinite Mixture Models with Nonparametric Bayes and the Dirichlet ProcessInstant interactive visualization with d3 + ggplot2Movie recommendations and more via MapReduce and ScaldingQuick Introduction to ggplot2Introduction to Conditional Random FieldsWinning the Netflix Prize: A Summary

Latest TweetsLatest Tweets

Status updating...

Follow Follow @echen@echen 4,571 followers

Copyright © 2013 - Edwin Chen - Powered by Octopress