View
224
Download
0
Category
Preview:
Citation preview
Conditional Random Fields
Sequence Labeling: The Problem
• Given a sequence (in NLP, words), assign appropriate labels to each word.
• For example, POS tagging:
The cat sat on the mat .DT NN VBD IN DT NN .
Sequence Labeling: The Problem
• Given a sequence (in NLP, words), assign appropriate labels to each word.
• Another example, partial parsing (aka chunking):
The cat sat on the matB-NP I-NP B-VPB-PP B-NP I-NP
Sequence Labeling: The Problem
• Given a sequence (in NLP, words), assign appropriate labels to each word.
• Another example, relation extraction:
The cat sat on the matB-ArgI-ArgB-Rel I-Rel B-Arg I-Arg
The CRF Equation
• A CRF model consists of – F = <f1, …, fk>, a vector of “feature functions”
– θ = < θ1, …, θk>, a vector of weights for each feature function.
• Let O = < o1, …, oT> be an observed sentence
• Let A = <a1, …, aT> be the labent variables.
• This is the same as the Maximum Entropy equation!
y
OyFθ
OyFθO|yA
,exp
,exp)(P
• Note that the denominator depends on O, but not on y (it’s marginalizing over y).
• Typically, we write
where
CRF Equation, standard format
Oy,FθO
O|yA exp)(
1)(Z
P
y
O,yFθO exp)(Z
Making Structured Predictions
Aside: Structured prediction vs. Text Classification
Recall: max. ent. for text classification:
CRFs for sequence labeling:
What’s the difference?
doc,Fθ
doc,Fθdoc
docO|
c
cZ
cAP
c
cc
maxarg
exp)(
1maxarg)(maxarg
Oy,Fθ
Oy,FθO
O|yA
y
yy
maxarg
exp)(
1maxarg)(maxarg
ZP
Aside: Structured prediction vs. Text Classification
Two (related) differences, both for the sake of efficiency:
1)Feature functions in CRFs are restricted to graph parts (described later)
2)We can’t do brute force to compute the argmax. Instead, we do Viterbi.
Finding the Best Sequence
Best sequence is
Recall from HMM discussion:If there are
K possible states for each yi variable,
and N total yi variables,
Then there are KN possible settings for ySo brute force can’t find the best sequence. Instead, we resort to a Viterbi-like dynamic program.
Oy,Fθ
Oy,FθO
O|yA
y
yy
maxarg
exp)(
1maxarg)(maxarg
ZP
oTo1 otot-1 ot+1
Viterbi Algorithm
),,...,...(max)( 1111... 11
ttttyy
j ojyooyytt
Fθ
The state sequence which maximizes the score of seeing the observations to time t-1, landing in state j at time t, and seeing the observation at time t
A1 At-1 At=j
oTo1 otot-1 ot+1
Viterbi Algorithm
)(maxargˆ TX ii
T
)1(ˆ1
^
tXtX
t
)(maxarg)ˆ( TXP ii
Compute the most likely state sequence by working backwards
x1 xt-1 xt xt+1 xT
Viterbi Algorithm
1)(max)1(
tjoijii
j batt
1)(maxarg)1(
tjoijii
j batt Recursive Computation
oTo1 otot-1 ot+1
A1 At-1 At=j At+1
),,...,...(max)( 1111... 11
ttttyy
j ojyooyytt
Fθ
??!
??!
Feature functions and Graph parts
To make efficient computation (dynamic programs) possible, we restrict the feature functions to:
Graph parts (or just parts): A feature function that counts how often a particular configuration occurs for a clique in the CRF graph.
Clique: a set of completely connected nodes in a graph. That is, each node in the clique has an edge connecting it to every other node in the clique.
Clique Example
The cliques in a linear chain CRF are the set of individual nodes, and the set of pairs of consecutive nodes.
1
x1
2
x2
3
x3
4
x4
5
x5
6
x6
CRF
Clique Example
The cliques in a linear chain CRF are the set of individual nodes, and the set of pairs of consecutive nodes.
1
x1
2
x2
3
x3
4
x4
5
x5
6
x6
CRF
Individual node cliques
Clique Example
The cliques in a linear chain CRF are the set of individual nodes, and the set of pairs of consecutive nodes.
1
x1
2
x2
3
x3
4
x4
5
x5
6
x6
CRF
Pair-of-node cliques
Clique Example
For non-linear-chain CRFs (something we won’t normally consider in this class), you can get larger cliques:
1
x1
2
x2
3
x3
4
x4
5
x5
6
x6
CRF
Larger cliques
5’
Graph part as Feature Function Example
Graph parts are feature functions p(y,x) that count how many cliques have a particular configuration.
For example, p(y,x) = count of [yi = Noun].
Here, y2 and y6 are both Nouns, so p(y,x) = 2.
y1=D
x1
y2=N
x2
y3=V
x3
y4=D
x4
y5=A
x5
y6=N
x6
CRF
Graph part as Feature Function Example
For a pair-of-nodes example, p(y,x) = count of [yi = Noun,yi+1=Verb]
Here, y2 is a Noun and y3 is a Verb, so p(y,x) = 1.
y1=D
x1
y2=N
x2
y3=V
x3
y4=D
x4
y5=A
x5
y6=N
x6
CRF
Features can depend on the whole observation
In a CRF, each feature function can depend on x, in addition to a clique in y
Normally, we draw a CRF like this:
1
x1
2
x2
3
x3
4
x4
5
x5
6
x6
HMM
CRF 1
x1
2
x2
3
x3
4
x4
5
x5
6
x6
Features can depend on the whole observation
In a CRF, each feature function can depend on x, in addition to a clique in y
But really, it’s more like this:
This would cause problems for a generative model, but in a conditional model, x is always a fixed constant. So we can still calculate relevant algorithms like Viterbi efficiently.
1
x1
2
x2
3
x3
4
x4
5
x5
6
x6
1
x1
2
x2
3
x3
4
x4
5
x5
6
x6
HMM
CRF
Graph part as Feature Function Example
An example part including x: p(y,x) = count of [yi = A or D,yi+1=N,x2=cat]
Here, y1 is a D and y2 is a N, plus y5 is a A and y6 is a N, plus x2=cat, so p(y,x) = 2.
Notice that the clique y5-y6 is allowed to depend on x2.
y1=D
The
y2=N
cat
y3=V
chased
y4=D
the
y5=A
tiny
y6=N
fly
CRF
Graph part as Feature Function Example
An more usual example including x: p(y,x) = count of [yi = A or D,yi+1=N,xi+1=cat]
Here, y1 is a D and y2 is a N, plus x2=cat, so p(y,x)=1.
y1=D
The
y2=N
cat
y3=V
chased
y4=D
the
y5=A
tiny
y6=N
fly
CRF
The CRF Equation, with Parts
• A CRF model consists of – P = <p1, …, pk>, a vector of parts
– θ = < θ1, …, θk>, a vector of weights for each part.
• Let O = < o1, …, oT> be an observed sentence
• Let A = <a1, …, aT> be the labent variables.
)(
,exp)(
O
OyPθO|yA
ZP
Viterbi Algorithm – 2nd Try
),,(
),(
)(
max)1(
1
1
oPθ
oPθ
jyiy
jy
t
t
ttpairpair
toneone
i
ij
Recursive
Computation
oTo1 otot-1 ot+1
A1 At-1 At=j At+1
),,...(max)( 11... 11
oPθ jyyyt ttyy
jt
),,(
),(
)(
maxarg)1(
1
1
oPθ
oPθ
jyiy
jy
t
t
ttpairpair
toneone
i
ij
Supervised Parameter Estimation
Conditional Training• Given a set of observations o and the correct labels y
for each, determine the best θ:
• Because the CRF equation is just a special form of the maximum entropy equation, we can train it exactly the same way: – Determine the gradient– Step in the direction of the gradient– Repeat until convergence
)(maxarg θo|yθ
,P
Recall: Training a ME model
Training is an optimization problem:find the value for λ that maximizes the conditional log-likelihood of the training data:
29
Traindc iii
Traindc
dZdcf
dcPTrainCLL
,
,
)(log),(
)|(log)(
Recall: Training a ME model
Optimization is normally performed using some form of gradient descent:0) Initialize λ0 to 0
1) Compute the gradient: ∇CLL2) Take a step in the direction of the gradient:λi+1 = λi + α ∇CLL
3) Repeat until CLL doesn’t improve:stop when |CLL(λi+1) – CLL(λi)| < ε
30
Recall: Training a ME model
Computing the gradient:
31
TraindciPi
Traindcc i
ii
c iiii
i
Traindc c iii
ii
Traindc iii
ii
dcfdcf
dcf
dcfdcfdcf
dcfdcf
dZdcfTrainCLL
,
,
,
,
),(E),(
),(exp
),(exp),(),(
),(explog),(
)(log),()(
Recall: Training a ME model
Computing the gradient:
32
TraindciPi
Traindcc i
ii
c iiii
i
Traindc c iii
ii
Traindc iii
ii
dcfdcf
dcf
dcfdcfdcf
dcfdcf
dZdcfTrainCLL
,
,
,
,
),(E),(
),(exp
),(exp),(),(
),(explog),(
)(log),()(
The hard part for CRFs
Training a CRF: Expected feature counts
• … (sorry, ran out of time)
CRFs vs. HMMs
Generative (Joint Probability) Models
• HMMs are generative models: That is, they can compute the joint probability P(sentence, hidden-states)
• From a generative model, one can compute– Conditional models P(sentence | hidden-states) and
P(hidden-states| sentence)– Marginal models P(sentence) and P(hidden-states)
• For sequence labeling, we want P(hidden-states | sentence)
Discriminative (Conditional) Models
• Most often, people are most interested in the conditional probability P(hidden-states | sentence)For example, this is the distribution needed for sequence labeling.
• Discriminative (also called conditional) models directly represent the conditional distribution P(hidden-states | sentence)– These models cannot tell you the joint distribution, marginals, or other
conditionals.– But they’re quite good at this particular conditional distribution.
Discriminative vs. GenerativeHMM (generative) CRF (discriminative)
Marginal, orLanguage model:P(sentence)
Forward algorithm or Backward algorithm,
linear in length of sentence
Can’t do it.
Find optimal label sequence
Viterbi,Linear in length of
sentence
Viterbi,Linear in length of
sentence
Supervised parameter estimation
Bayesian learning,Easy and fast
Convex optimization,Can be quite slow
Unsupervised parameter estimation
Baum-Welch (non-convex optimization),
Slow but doable
Very difficult, and requires making extra assumptions.
Feature functions Parents and children in the graph
Restrictive!
Arbitrary functions of a latent state and any
portion of the observed nodes
CRFs vs. HMMs, a closer look
It’s possible to convert an HMM into a CRF:Set pprior,state(y,x) = count[y1=state]Set θprior,state = log PHMM(y1=state) = log state
Set ptrans,state1,state2(y,x)= count[yi=state1,yi+1=state2]Set θtrans,state1,state2 = log PHMM(yi+1=state2|yi=state1)
= log Astate1,state2
Set pobs,state,word(y,x)= count[yi=state,xi=word]Set θobs,state,word = log PHMM(xi=word|yi=state)
= log Bstate,word
CRF vs. HMM, a closer look
If we convert an HMM to a CRF, all of the CRF parameters θ will be logs of probabilities.Therefore, they will all be between –∞ and 0
Notice: CRF parameters can be between –∞ and +∞.
So, how do HMMs and CRFs compare in terms of bias and variance (as sequence labelers)?– HMMs have more bias– CRFs have more variance
Comparing feature functionsThe biggest advantage of CRFs over HMMs is that they can handle
overlapping features.
For example, for POS tagging, using words as a features (like xi=“the” or xj=“jogging”) is quite useful.
However, it’s often also useful to use “orthographic” features, like “the word ends in –ing” or “the word starts with a capital letter.”
These features overlap: some words end in “ing”, some don’t.
• Generative models have to include in the model parameters for predicting when features will overlap.
• Discriminative models don’t: they can simply use the features.
CRF Example
A CRF POS Tagger for English
Vocabulary
We need to determine the set of possible word types V.
Let V = {all types in 1 million tokens of Wall Street Journal text, which we’ll use for training}
U {UNKNOWN} (for word types we haven’t seen)
L = Label Set
Standard Penn Treebank tagsetNumber Tag Description
1. CC Coordinating conjunction
2. CD Cardinal number
3. DT Determiner
4. EX Existential there
5. FW Foreign word
6. IN Preposition or subordinating conjunction
7. JJ Adjective
8. JJR Adjective, comparative
Number Tag Description
9. JJS Adjective, superlative
10. LS List item marker
11. MD Modal
12. NN Noun, singular or mass
13. NNS Noun, plural
14. NNP Proper noun, singular
15. NNPS Proper noun, plural
16. PDT Predeterminer
17. POS Possessive ending
L = Label SetNumber Tag Description
18. PRP Personal pronoun
19. PRP$ Possessive pronoun
20. RB Adverb
21. RBR Adverb, comparative
22. RBS Adverb, superlative
23. RP Particle
24. SYM Symbol
25. TO to
26. UH Interjection
27. VB Verb, base form
28. VBD Verb, past tense
29. VBG Verb, gerund or present participle
Number Tag Description
30. VBN Verb, past participle
31. VBP Verb, non-3rd person singular present
32. VBZ Verb, 3rd person singular present
33. WDT Wh-determiner
34. WP Wh-pronoun
35. WP$ Possessive wh-pronoun
36. WRB Wh-adverb
CRF FeaturesFeature Type Description
Prior k yi = k
Transition k,k’ yi = k and yi+1=k’
Word k,w yi = k and xi=wk,w yi = k and xi-1=wk,w yi = k and xi+1=wk,w,w’ yi = k and xi=w and xi-1=w’k,w,w’ yi = k and xi=w and xi+1=w’
Orthography: Suffix s in {“ing”,”ed”,”ogy”,”s”,”ly”,”ion”,”tion”, “ity”, …} and k yi=k and xi ends with s
Orthography: Punctuation k yi = k and xi is capitalizedk yi = k and xi is hyphenatedk yi = k and xi contains a periodk yi = k and xi is ALL CAPSk yi = k and xi contains a digit (0-9)…
Recommended