Statistical Natural Language Parsing
Mausam
(Based on slides of Michael Collins Dan Jurafsky Dan Klein Chris Manning Ray Mooney Luke Zettlemoyer)
Two views of linguistic structure 1 Constituency (phrase structure)
bull Phrase structure organizes words into nested constituents
bull How do we know what is a constituent (Not that linguists donrsquot argue about some cases)bull Distribution a constituent behaves as a unit that can appear in different
places
bull John talked [to the children] [about drugs]
bull John talked [about drugs] [to the children]
bull John talked drugs to the children about
bull Substitutionexpansionpro-forms
bull I sat [on the boxright on top of the boxthere]
bull Coordination regular internal structure no intrusion fragments semantics hellip
Two views of linguistic structure 2 Dependency structure
bull Dependency structure shows which words depend on (modify or are arguments of) which other words
The boy put the tortoise on the rug
rug
the
the
ontortoise
put
boy
The
Why Parse
bull Part of speech information
bull Phrase information
bull Useful relationships
8
The rise of annotated data
The Penn Treebank
( (S(NP-SBJ (DT The) (NN move))(VP (VBD followed)
(NP(NP (DT a) (NN round))(PP (IN of)(NP(NP (JJ similar) (NNS increases))(PP (IN by)
(NP (JJ other) (NNS lenders)))(PP (IN against)
(NP (NNP Arizona) (JJ real) (NN estate) (NNS loans))))))( )(S-ADV
(NP-SBJ (-NONE- ))(VP (VBG reflecting)(NP(NP (DT a) (VBG continuing) (NN decline))(PP-LOC (IN in)
(NP (DT that) (NN market)))))))( )))
[Marcus et al 1993 Computational Linguistics]
Penn Treebank Non-terminals
The rise of annotated data
bull Starting off building a treebank seems a lot slower and less useful than building a grammar
bull But a treebank gives us many thingsbull Reusability of the labor
bull Many parsers POS taggers etc
bull Valuable resource for linguistics
bull Broad coverage
bull Frequencies and distributional information
bull A way to evaluate systems
Statistical parsing applications
Statistical parsers are now robust and widely used in larger NLP applications
bull High precision question answering [Pasca and Harabagiu SIGIR 2001]
bull Improving biological named entity finding [Finkel et al JNLPBA 2004]
bull Syntactically based sentence compression [Lin and Wilbur 2007]
bull Extracting opinions about products [Bloom et al NAACL 2007]
bull Improved interaction in computer games [Gorniak and Roy 2005]
bull Helping linguists find data [Resnik et al BLS 2005]
bull Source sentence analysis for machine translation [Xu et al 2009]
bull Relation extraction systems [Fundel et al Bioinformatics 2006]
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN V NPIN NP
DT NNDT NNthe
boyput
tortoisethethe rug
on
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN V NPINNP
DT NNDT NNthe
boyput
tortoisethethe rug
on
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN VNPINNP
DT NNDT NNthe
boy
put
tortoisethethe rug
on
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN VNPINNP
DT NNDT NNलड़क न रखाकछआकालीन
ऊपर
Pre 1990 (ldquoClassicalrdquo) NLP Parsing
bull Goes back to Chomskyrsquos PhD thesis in 1950s
bull Wrote symbolic grammar (CFG or often richer) and lexiconS NP VP NN interest
NP (DT) NN NNS rates
NP NN NNS NNS raises
NP NNP VBP interest
VP V NP VBZ rates
bull Used grammarproof systems to prove parses from words
bull This scaled very badly and didnrsquot give coverage For sentence
Fed raises interest rates 05 in effort to control inflationbull Minimal grammar 36 parses
bull Simple 10 rule grammar 592 parses
bull Real-size broad-coverage grammar millions of parses
Classical NLP ParsingThe problem and its solution
bull Categorical constraints can be added to grammars to limit unlikelyweird parses for sentencesbull But the attempt make the grammars not robust
bull In traditional systems commonly 30 of sentences in even an edited text would have no parse
bull A less constrained grammar can parse more sentencesbull But simple sentences end up with ever more parses with no way to
choose between them
bull We need mechanisms that allow us to find the most likely parse(s) for a sentencebull Statistical parsing lets us work with very loose grammars that admit
millions of parses for sentences but still quickly find the best parse(s)
Context Free Grammars and Ambiguities
20
Context-Free Grammars
21
Context-Free Grammars in NLP
bull A context free grammar G in NLP = (N C Σ S L R)bull Σ is a set of terminal symbols
bull C is a set of preterminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull L is the lexicon a set of items of the form X x
bull X isin C and x isin Σ
bull R is the grammar a set of items of the form X
bull X isin N and isin (N cup C)
bull By usual convention S is the start symbol but in statistical NLP we usually have an extra node at the top (ROOT TOP)
bull We usually write e for an empty sequence rather than nothing22
A Context Free Grammar of English
23
Left-Most Derivations
24
Properties of CFGs
25
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
bull Catalan numbers Cn
= (2n)[(n+1)n]
bull An exponentially growing series which arises in many tree-like contexts
bull Eg the number of possible triangulations of a polygon with n+2 sides
bull Turns up in triangulation of probabilistic graphical modelshellip
Attachments
bull I cleaned the dishes from dinner
bull I cleaned the dishes with detergent
bull I cleaned the dishes in my pajamas
bull I cleaned the dishes in the sink
Syntactic Ambiguities I
bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles
bull Particle vs prepositionThe lady dressed up the staircase
bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand
bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers
Syntactic Ambiguities II
bull Modifier scope within NPsimpractical design requirementsplastic cup holder
bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue
bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall
Non-Local Phenomena
bull Dislocation gappingbull Which book should Peter buy
bull A debate arose which continued until the election
bull Bindingbull Reference
bull The IRS audits itself
bull Controlbull I want to go
bull I want you to go
32
A Fragment of a Noun Phrase Grammar
33
Extended Grammar with Prepositional Phrases
34
Verbs Verb Phrases and Sentences
35
PPs Modifying Verb Phrases
36
Complementizers and SBARs
37
More Verbs
38
Coordination
39
Much more remainshellip
40
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve2 Choosing the correct parse
bull How do we work out the correct attachment
bull She saw the man with a telescope
bull Is the problem lsquoAI completersquo Yes but hellip
bull Words are good predictors of attachmentbull Even absent full understanding
bull Moscow sent more than 100000 soldiers into Afghanistan hellip
bull Sydney Water breached an agreement with NSW Health hellip
bull Our statistical parsers will try to exploit such statistics
Probabilistic Context Free Grammar
45
Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)
bull G = (Σ N S R P)bull T is a set of terminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull R is a set of rulesproductions of the form X
bull P is a probability function
bull P R [01]
bull
bull A grammar G generates a language model L
P(g) =1g IcircT
aring
PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
Example of a PCFG
48
Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
A Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
The man sleeps
The man saw the woman with the telescope
NNDT Vi
VPNP
NNDT
NP
NNDT
NP
NNDT
NPVt
VP
IN
PP
VP
S
S
t1=
p(t1)=100310070410
10
0403
10 07 10
t2=
p(ts)=180310070204100310020405031001
10
03 03 03
02
04 04
0510
10 10 1007 02 01
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Two views of linguistic structure 1 Constituency (phrase structure)
bull Phrase structure organizes words into nested constituents
bull How do we know what is a constituent (Not that linguists donrsquot argue about some cases)bull Distribution a constituent behaves as a unit that can appear in different
places
bull John talked [to the children] [about drugs]
bull John talked [about drugs] [to the children]
bull John talked drugs to the children about
bull Substitutionexpansionpro-forms
bull I sat [on the boxright on top of the boxthere]
bull Coordination regular internal structure no intrusion fragments semantics hellip
Two views of linguistic structure 2 Dependency structure
bull Dependency structure shows which words depend on (modify or are arguments of) which other words
The boy put the tortoise on the rug
rug
the
the
ontortoise
put
boy
The
Why Parse
bull Part of speech information
bull Phrase information
bull Useful relationships
8
The rise of annotated data
The Penn Treebank
( (S(NP-SBJ (DT The) (NN move))(VP (VBD followed)
(NP(NP (DT a) (NN round))(PP (IN of)(NP(NP (JJ similar) (NNS increases))(PP (IN by)
(NP (JJ other) (NNS lenders)))(PP (IN against)
(NP (NNP Arizona) (JJ real) (NN estate) (NNS loans))))))( )(S-ADV
(NP-SBJ (-NONE- ))(VP (VBG reflecting)(NP(NP (DT a) (VBG continuing) (NN decline))(PP-LOC (IN in)
(NP (DT that) (NN market)))))))( )))
[Marcus et al 1993 Computational Linguistics]
Penn Treebank Non-terminals
The rise of annotated data
bull Starting off building a treebank seems a lot slower and less useful than building a grammar
bull But a treebank gives us many thingsbull Reusability of the labor
bull Many parsers POS taggers etc
bull Valuable resource for linguistics
bull Broad coverage
bull Frequencies and distributional information
bull A way to evaluate systems
Statistical parsing applications
Statistical parsers are now robust and widely used in larger NLP applications
bull High precision question answering [Pasca and Harabagiu SIGIR 2001]
bull Improving biological named entity finding [Finkel et al JNLPBA 2004]
bull Syntactically based sentence compression [Lin and Wilbur 2007]
bull Extracting opinions about products [Bloom et al NAACL 2007]
bull Improved interaction in computer games [Gorniak and Roy 2005]
bull Helping linguists find data [Resnik et al BLS 2005]
bull Source sentence analysis for machine translation [Xu et al 2009]
bull Relation extraction systems [Fundel et al Bioinformatics 2006]
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN V NPIN NP
DT NNDT NNthe
boyput
tortoisethethe rug
on
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN V NPINNP
DT NNDT NNthe
boyput
tortoisethethe rug
on
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN VNPINNP
DT NNDT NNthe
boy
put
tortoisethethe rug
on
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN VNPINNP
DT NNDT NNलड़क न रखाकछआकालीन
ऊपर
Pre 1990 (ldquoClassicalrdquo) NLP Parsing
bull Goes back to Chomskyrsquos PhD thesis in 1950s
bull Wrote symbolic grammar (CFG or often richer) and lexiconS NP VP NN interest
NP (DT) NN NNS rates
NP NN NNS NNS raises
NP NNP VBP interest
VP V NP VBZ rates
bull Used grammarproof systems to prove parses from words
bull This scaled very badly and didnrsquot give coverage For sentence
Fed raises interest rates 05 in effort to control inflationbull Minimal grammar 36 parses
bull Simple 10 rule grammar 592 parses
bull Real-size broad-coverage grammar millions of parses
Classical NLP ParsingThe problem and its solution
bull Categorical constraints can be added to grammars to limit unlikelyweird parses for sentencesbull But the attempt make the grammars not robust
bull In traditional systems commonly 30 of sentences in even an edited text would have no parse
bull A less constrained grammar can parse more sentencesbull But simple sentences end up with ever more parses with no way to
choose between them
bull We need mechanisms that allow us to find the most likely parse(s) for a sentencebull Statistical parsing lets us work with very loose grammars that admit
millions of parses for sentences but still quickly find the best parse(s)
Context Free Grammars and Ambiguities
20
Context-Free Grammars
21
Context-Free Grammars in NLP
bull A context free grammar G in NLP = (N C Σ S L R)bull Σ is a set of terminal symbols
bull C is a set of preterminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull L is the lexicon a set of items of the form X x
bull X isin C and x isin Σ
bull R is the grammar a set of items of the form X
bull X isin N and isin (N cup C)
bull By usual convention S is the start symbol but in statistical NLP we usually have an extra node at the top (ROOT TOP)
bull We usually write e for an empty sequence rather than nothing22
A Context Free Grammar of English
23
Left-Most Derivations
24
Properties of CFGs
25
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
bull Catalan numbers Cn
= (2n)[(n+1)n]
bull An exponentially growing series which arises in many tree-like contexts
bull Eg the number of possible triangulations of a polygon with n+2 sides
bull Turns up in triangulation of probabilistic graphical modelshellip
Attachments
bull I cleaned the dishes from dinner
bull I cleaned the dishes with detergent
bull I cleaned the dishes in my pajamas
bull I cleaned the dishes in the sink
Syntactic Ambiguities I
bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles
bull Particle vs prepositionThe lady dressed up the staircase
bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand
bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers
Syntactic Ambiguities II
bull Modifier scope within NPsimpractical design requirementsplastic cup holder
bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue
bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall
Non-Local Phenomena
bull Dislocation gappingbull Which book should Peter buy
bull A debate arose which continued until the election
bull Bindingbull Reference
bull The IRS audits itself
bull Controlbull I want to go
bull I want you to go
32
A Fragment of a Noun Phrase Grammar
33
Extended Grammar with Prepositional Phrases
34
Verbs Verb Phrases and Sentences
35
PPs Modifying Verb Phrases
36
Complementizers and SBARs
37
More Verbs
38
Coordination
39
Much more remainshellip
40
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve2 Choosing the correct parse
bull How do we work out the correct attachment
bull She saw the man with a telescope
bull Is the problem lsquoAI completersquo Yes but hellip
bull Words are good predictors of attachmentbull Even absent full understanding
bull Moscow sent more than 100000 soldiers into Afghanistan hellip
bull Sydney Water breached an agreement with NSW Health hellip
bull Our statistical parsers will try to exploit such statistics
Probabilistic Context Free Grammar
45
Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)
bull G = (Σ N S R P)bull T is a set of terminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull R is a set of rulesproductions of the form X
bull P is a probability function
bull P R [01]
bull
bull A grammar G generates a language model L
P(g) =1g IcircT
aring
PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
Example of a PCFG
48
Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
A Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
The man sleeps
The man saw the woman with the telescope
NNDT Vi
VPNP
NNDT
NP
NNDT
NP
NNDT
NPVt
VP
IN
PP
VP
S
S
t1=
p(t1)=100310070410
10
0403
10 07 10
t2=
p(ts)=180310070204100310020405031001
10
03 03 03
02
04 04
0510
10 10 1007 02 01
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Two views of linguistic structure 2 Dependency structure
bull Dependency structure shows which words depend on (modify or are arguments of) which other words
The boy put the tortoise on the rug
rug
the
the
ontortoise
put
boy
The
Why Parse
bull Part of speech information
bull Phrase information
bull Useful relationships
8
The rise of annotated data
The Penn Treebank
( (S(NP-SBJ (DT The) (NN move))(VP (VBD followed)
(NP(NP (DT a) (NN round))(PP (IN of)(NP(NP (JJ similar) (NNS increases))(PP (IN by)
(NP (JJ other) (NNS lenders)))(PP (IN against)
(NP (NNP Arizona) (JJ real) (NN estate) (NNS loans))))))( )(S-ADV
(NP-SBJ (-NONE- ))(VP (VBG reflecting)(NP(NP (DT a) (VBG continuing) (NN decline))(PP-LOC (IN in)
(NP (DT that) (NN market)))))))( )))
[Marcus et al 1993 Computational Linguistics]
Penn Treebank Non-terminals
The rise of annotated data
bull Starting off building a treebank seems a lot slower and less useful than building a grammar
bull But a treebank gives us many thingsbull Reusability of the labor
bull Many parsers POS taggers etc
bull Valuable resource for linguistics
bull Broad coverage
bull Frequencies and distributional information
bull A way to evaluate systems
Statistical parsing applications
Statistical parsers are now robust and widely used in larger NLP applications
bull High precision question answering [Pasca and Harabagiu SIGIR 2001]
bull Improving biological named entity finding [Finkel et al JNLPBA 2004]
bull Syntactically based sentence compression [Lin and Wilbur 2007]
bull Extracting opinions about products [Bloom et al NAACL 2007]
bull Improved interaction in computer games [Gorniak and Roy 2005]
bull Helping linguists find data [Resnik et al BLS 2005]
bull Source sentence analysis for machine translation [Xu et al 2009]
bull Relation extraction systems [Fundel et al Bioinformatics 2006]
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN V NPIN NP
DT NNDT NNthe
boyput
tortoisethethe rug
on
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN V NPINNP
DT NNDT NNthe
boyput
tortoisethethe rug
on
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN VNPINNP
DT NNDT NNthe
boy
put
tortoisethethe rug
on
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN VNPINNP
DT NNDT NNलड़क न रखाकछआकालीन
ऊपर
Pre 1990 (ldquoClassicalrdquo) NLP Parsing
bull Goes back to Chomskyrsquos PhD thesis in 1950s
bull Wrote symbolic grammar (CFG or often richer) and lexiconS NP VP NN interest
NP (DT) NN NNS rates
NP NN NNS NNS raises
NP NNP VBP interest
VP V NP VBZ rates
bull Used grammarproof systems to prove parses from words
bull This scaled very badly and didnrsquot give coverage For sentence
Fed raises interest rates 05 in effort to control inflationbull Minimal grammar 36 parses
bull Simple 10 rule grammar 592 parses
bull Real-size broad-coverage grammar millions of parses
Classical NLP ParsingThe problem and its solution
bull Categorical constraints can be added to grammars to limit unlikelyweird parses for sentencesbull But the attempt make the grammars not robust
bull In traditional systems commonly 30 of sentences in even an edited text would have no parse
bull A less constrained grammar can parse more sentencesbull But simple sentences end up with ever more parses with no way to
choose between them
bull We need mechanisms that allow us to find the most likely parse(s) for a sentencebull Statistical parsing lets us work with very loose grammars that admit
millions of parses for sentences but still quickly find the best parse(s)
Context Free Grammars and Ambiguities
20
Context-Free Grammars
21
Context-Free Grammars in NLP
bull A context free grammar G in NLP = (N C Σ S L R)bull Σ is a set of terminal symbols
bull C is a set of preterminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull L is the lexicon a set of items of the form X x
bull X isin C and x isin Σ
bull R is the grammar a set of items of the form X
bull X isin N and isin (N cup C)
bull By usual convention S is the start symbol but in statistical NLP we usually have an extra node at the top (ROOT TOP)
bull We usually write e for an empty sequence rather than nothing22
A Context Free Grammar of English
23
Left-Most Derivations
24
Properties of CFGs
25
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
bull Catalan numbers Cn
= (2n)[(n+1)n]
bull An exponentially growing series which arises in many tree-like contexts
bull Eg the number of possible triangulations of a polygon with n+2 sides
bull Turns up in triangulation of probabilistic graphical modelshellip
Attachments
bull I cleaned the dishes from dinner
bull I cleaned the dishes with detergent
bull I cleaned the dishes in my pajamas
bull I cleaned the dishes in the sink
Syntactic Ambiguities I
bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles
bull Particle vs prepositionThe lady dressed up the staircase
bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand
bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers
Syntactic Ambiguities II
bull Modifier scope within NPsimpractical design requirementsplastic cup holder
bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue
bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall
Non-Local Phenomena
bull Dislocation gappingbull Which book should Peter buy
bull A debate arose which continued until the election
bull Bindingbull Reference
bull The IRS audits itself
bull Controlbull I want to go
bull I want you to go
32
A Fragment of a Noun Phrase Grammar
33
Extended Grammar with Prepositional Phrases
34
Verbs Verb Phrases and Sentences
35
PPs Modifying Verb Phrases
36
Complementizers and SBARs
37
More Verbs
38
Coordination
39
Much more remainshellip
40
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve2 Choosing the correct parse
bull How do we work out the correct attachment
bull She saw the man with a telescope
bull Is the problem lsquoAI completersquo Yes but hellip
bull Words are good predictors of attachmentbull Even absent full understanding
bull Moscow sent more than 100000 soldiers into Afghanistan hellip
bull Sydney Water breached an agreement with NSW Health hellip
bull Our statistical parsers will try to exploit such statistics
Probabilistic Context Free Grammar
45
Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)
bull G = (Σ N S R P)bull T is a set of terminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull R is a set of rulesproductions of the form X
bull P is a probability function
bull P R [01]
bull
bull A grammar G generates a language model L
P(g) =1g IcircT
aring
PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
Example of a PCFG
48
Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
A Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
The man sleeps
The man saw the woman with the telescope
NNDT Vi
VPNP
NNDT
NP
NNDT
NP
NNDT
NPVt
VP
IN
PP
VP
S
S
t1=
p(t1)=100310070410
10
0403
10 07 10
t2=
p(ts)=180310070204100310020405031001
10
03 03 03
02
04 04
0510
10 10 1007 02 01
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Why Parse
bull Part of speech information
bull Phrase information
bull Useful relationships
8
The rise of annotated data
The Penn Treebank
( (S(NP-SBJ (DT The) (NN move))(VP (VBD followed)
(NP(NP (DT a) (NN round))(PP (IN of)(NP(NP (JJ similar) (NNS increases))(PP (IN by)
(NP (JJ other) (NNS lenders)))(PP (IN against)
(NP (NNP Arizona) (JJ real) (NN estate) (NNS loans))))))( )(S-ADV
(NP-SBJ (-NONE- ))(VP (VBG reflecting)(NP(NP (DT a) (VBG continuing) (NN decline))(PP-LOC (IN in)
(NP (DT that) (NN market)))))))( )))
[Marcus et al 1993 Computational Linguistics]
Penn Treebank Non-terminals
The rise of annotated data
bull Starting off building a treebank seems a lot slower and less useful than building a grammar
bull But a treebank gives us many thingsbull Reusability of the labor
bull Many parsers POS taggers etc
bull Valuable resource for linguistics
bull Broad coverage
bull Frequencies and distributional information
bull A way to evaluate systems
Statistical parsing applications
Statistical parsers are now robust and widely used in larger NLP applications
bull High precision question answering [Pasca and Harabagiu SIGIR 2001]
bull Improving biological named entity finding [Finkel et al JNLPBA 2004]
bull Syntactically based sentence compression [Lin and Wilbur 2007]
bull Extracting opinions about products [Bloom et al NAACL 2007]
bull Improved interaction in computer games [Gorniak and Roy 2005]
bull Helping linguists find data [Resnik et al BLS 2005]
bull Source sentence analysis for machine translation [Xu et al 2009]
bull Relation extraction systems [Fundel et al Bioinformatics 2006]
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN V NPIN NP
DT NNDT NNthe
boyput
tortoisethethe rug
on
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN V NPINNP
DT NNDT NNthe
boyput
tortoisethethe rug
on
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN VNPINNP
DT NNDT NNthe
boy
put
tortoisethethe rug
on
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN VNPINNP
DT NNDT NNलड़क न रखाकछआकालीन
ऊपर
Pre 1990 (ldquoClassicalrdquo) NLP Parsing
bull Goes back to Chomskyrsquos PhD thesis in 1950s
bull Wrote symbolic grammar (CFG or often richer) and lexiconS NP VP NN interest
NP (DT) NN NNS rates
NP NN NNS NNS raises
NP NNP VBP interest
VP V NP VBZ rates
bull Used grammarproof systems to prove parses from words
bull This scaled very badly and didnrsquot give coverage For sentence
Fed raises interest rates 05 in effort to control inflationbull Minimal grammar 36 parses
bull Simple 10 rule grammar 592 parses
bull Real-size broad-coverage grammar millions of parses
Classical NLP ParsingThe problem and its solution
bull Categorical constraints can be added to grammars to limit unlikelyweird parses for sentencesbull But the attempt make the grammars not robust
bull In traditional systems commonly 30 of sentences in even an edited text would have no parse
bull A less constrained grammar can parse more sentencesbull But simple sentences end up with ever more parses with no way to
choose between them
bull We need mechanisms that allow us to find the most likely parse(s) for a sentencebull Statistical parsing lets us work with very loose grammars that admit
millions of parses for sentences but still quickly find the best parse(s)
Context Free Grammars and Ambiguities
20
Context-Free Grammars
21
Context-Free Grammars in NLP
bull A context free grammar G in NLP = (N C Σ S L R)bull Σ is a set of terminal symbols
bull C is a set of preterminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull L is the lexicon a set of items of the form X x
bull X isin C and x isin Σ
bull R is the grammar a set of items of the form X
bull X isin N and isin (N cup C)
bull By usual convention S is the start symbol but in statistical NLP we usually have an extra node at the top (ROOT TOP)
bull We usually write e for an empty sequence rather than nothing22
A Context Free Grammar of English
23
Left-Most Derivations
24
Properties of CFGs
25
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
bull Catalan numbers Cn
= (2n)[(n+1)n]
bull An exponentially growing series which arises in many tree-like contexts
bull Eg the number of possible triangulations of a polygon with n+2 sides
bull Turns up in triangulation of probabilistic graphical modelshellip
Attachments
bull I cleaned the dishes from dinner
bull I cleaned the dishes with detergent
bull I cleaned the dishes in my pajamas
bull I cleaned the dishes in the sink
Syntactic Ambiguities I
bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles
bull Particle vs prepositionThe lady dressed up the staircase
bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand
bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers
Syntactic Ambiguities II
bull Modifier scope within NPsimpractical design requirementsplastic cup holder
bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue
bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall
Non-Local Phenomena
bull Dislocation gappingbull Which book should Peter buy
bull A debate arose which continued until the election
bull Bindingbull Reference
bull The IRS audits itself
bull Controlbull I want to go
bull I want you to go
32
A Fragment of a Noun Phrase Grammar
33
Extended Grammar with Prepositional Phrases
34
Verbs Verb Phrases and Sentences
35
PPs Modifying Verb Phrases
36
Complementizers and SBARs
37
More Verbs
38
Coordination
39
Much more remainshellip
40
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve2 Choosing the correct parse
bull How do we work out the correct attachment
bull She saw the man with a telescope
bull Is the problem lsquoAI completersquo Yes but hellip
bull Words are good predictors of attachmentbull Even absent full understanding
bull Moscow sent more than 100000 soldiers into Afghanistan hellip
bull Sydney Water breached an agreement with NSW Health hellip
bull Our statistical parsers will try to exploit such statistics
Probabilistic Context Free Grammar
45
Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)
bull G = (Σ N S R P)bull T is a set of terminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull R is a set of rulesproductions of the form X
bull P is a probability function
bull P R [01]
bull
bull A grammar G generates a language model L
P(g) =1g IcircT
aring
PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
Example of a PCFG
48
Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
A Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
The man sleeps
The man saw the woman with the telescope
NNDT Vi
VPNP
NNDT
NP
NNDT
NP
NNDT
NPVt
VP
IN
PP
VP
S
S
t1=
p(t1)=100310070410
10
0403
10 07 10
t2=
p(ts)=180310070204100310020405031001
10
03 03 03
02
04 04
0510
10 10 1007 02 01
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
The rise of annotated data
The Penn Treebank
( (S(NP-SBJ (DT The) (NN move))(VP (VBD followed)
(NP(NP (DT a) (NN round))(PP (IN of)(NP(NP (JJ similar) (NNS increases))(PP (IN by)
(NP (JJ other) (NNS lenders)))(PP (IN against)
(NP (NNP Arizona) (JJ real) (NN estate) (NNS loans))))))( )(S-ADV
(NP-SBJ (-NONE- ))(VP (VBG reflecting)(NP(NP (DT a) (VBG continuing) (NN decline))(PP-LOC (IN in)
(NP (DT that) (NN market)))))))( )))
[Marcus et al 1993 Computational Linguistics]
Penn Treebank Non-terminals
The rise of annotated data
bull Starting off building a treebank seems a lot slower and less useful than building a grammar
bull But a treebank gives us many thingsbull Reusability of the labor
bull Many parsers POS taggers etc
bull Valuable resource for linguistics
bull Broad coverage
bull Frequencies and distributional information
bull A way to evaluate systems
Statistical parsing applications
Statistical parsers are now robust and widely used in larger NLP applications
bull High precision question answering [Pasca and Harabagiu SIGIR 2001]
bull Improving biological named entity finding [Finkel et al JNLPBA 2004]
bull Syntactically based sentence compression [Lin and Wilbur 2007]
bull Extracting opinions about products [Bloom et al NAACL 2007]
bull Improved interaction in computer games [Gorniak and Roy 2005]
bull Helping linguists find data [Resnik et al BLS 2005]
bull Source sentence analysis for machine translation [Xu et al 2009]
bull Relation extraction systems [Fundel et al Bioinformatics 2006]
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN V NPIN NP
DT NNDT NNthe
boyput
tortoisethethe rug
on
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN V NPINNP
DT NNDT NNthe
boyput
tortoisethethe rug
on
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN VNPINNP
DT NNDT NNthe
boy
put
tortoisethethe rug
on
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN VNPINNP
DT NNDT NNलड़क न रखाकछआकालीन
ऊपर
Pre 1990 (ldquoClassicalrdquo) NLP Parsing
bull Goes back to Chomskyrsquos PhD thesis in 1950s
bull Wrote symbolic grammar (CFG or often richer) and lexiconS NP VP NN interest
NP (DT) NN NNS rates
NP NN NNS NNS raises
NP NNP VBP interest
VP V NP VBZ rates
bull Used grammarproof systems to prove parses from words
bull This scaled very badly and didnrsquot give coverage For sentence
Fed raises interest rates 05 in effort to control inflationbull Minimal grammar 36 parses
bull Simple 10 rule grammar 592 parses
bull Real-size broad-coverage grammar millions of parses
Classical NLP ParsingThe problem and its solution
bull Categorical constraints can be added to grammars to limit unlikelyweird parses for sentencesbull But the attempt make the grammars not robust
bull In traditional systems commonly 30 of sentences in even an edited text would have no parse
bull A less constrained grammar can parse more sentencesbull But simple sentences end up with ever more parses with no way to
choose between them
bull We need mechanisms that allow us to find the most likely parse(s) for a sentencebull Statistical parsing lets us work with very loose grammars that admit
millions of parses for sentences but still quickly find the best parse(s)
Context Free Grammars and Ambiguities
20
Context-Free Grammars
21
Context-Free Grammars in NLP
bull A context free grammar G in NLP = (N C Σ S L R)bull Σ is a set of terminal symbols
bull C is a set of preterminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull L is the lexicon a set of items of the form X x
bull X isin C and x isin Σ
bull R is the grammar a set of items of the form X
bull X isin N and isin (N cup C)
bull By usual convention S is the start symbol but in statistical NLP we usually have an extra node at the top (ROOT TOP)
bull We usually write e for an empty sequence rather than nothing22
A Context Free Grammar of English
23
Left-Most Derivations
24
Properties of CFGs
25
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
bull Catalan numbers Cn
= (2n)[(n+1)n]
bull An exponentially growing series which arises in many tree-like contexts
bull Eg the number of possible triangulations of a polygon with n+2 sides
bull Turns up in triangulation of probabilistic graphical modelshellip
Attachments
bull I cleaned the dishes from dinner
bull I cleaned the dishes with detergent
bull I cleaned the dishes in my pajamas
bull I cleaned the dishes in the sink
Syntactic Ambiguities I
bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles
bull Particle vs prepositionThe lady dressed up the staircase
bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand
bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers
Syntactic Ambiguities II
bull Modifier scope within NPsimpractical design requirementsplastic cup holder
bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue
bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall
Non-Local Phenomena
bull Dislocation gappingbull Which book should Peter buy
bull A debate arose which continued until the election
bull Bindingbull Reference
bull The IRS audits itself
bull Controlbull I want to go
bull I want you to go
32
A Fragment of a Noun Phrase Grammar
33
Extended Grammar with Prepositional Phrases
34
Verbs Verb Phrases and Sentences
35
PPs Modifying Verb Phrases
36
Complementizers and SBARs
37
More Verbs
38
Coordination
39
Much more remainshellip
40
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve2 Choosing the correct parse
bull How do we work out the correct attachment
bull She saw the man with a telescope
bull Is the problem lsquoAI completersquo Yes but hellip
bull Words are good predictors of attachmentbull Even absent full understanding
bull Moscow sent more than 100000 soldiers into Afghanistan hellip
bull Sydney Water breached an agreement with NSW Health hellip
bull Our statistical parsers will try to exploit such statistics
Probabilistic Context Free Grammar
45
Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)
bull G = (Σ N S R P)bull T is a set of terminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull R is a set of rulesproductions of the form X
bull P is a probability function
bull P R [01]
bull
bull A grammar G generates a language model L
P(g) =1g IcircT
aring
PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
Example of a PCFG
48
Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
A Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
The man sleeps
The man saw the woman with the telescope
NNDT Vi
VPNP
NNDT
NP
NNDT
NP
NNDT
NPVt
VP
IN
PP
VP
S
S
t1=
p(t1)=100310070410
10
0403
10 07 10
t2=
p(ts)=180310070204100310020405031001
10
03 03 03
02
04 04
0510
10 10 1007 02 01
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Penn Treebank Non-terminals
The rise of annotated data
bull Starting off building a treebank seems a lot slower and less useful than building a grammar
bull But a treebank gives us many thingsbull Reusability of the labor
bull Many parsers POS taggers etc
bull Valuable resource for linguistics
bull Broad coverage
bull Frequencies and distributional information
bull A way to evaluate systems
Statistical parsing applications
Statistical parsers are now robust and widely used in larger NLP applications
bull High precision question answering [Pasca and Harabagiu SIGIR 2001]
bull Improving biological named entity finding [Finkel et al JNLPBA 2004]
bull Syntactically based sentence compression [Lin and Wilbur 2007]
bull Extracting opinions about products [Bloom et al NAACL 2007]
bull Improved interaction in computer games [Gorniak and Roy 2005]
bull Helping linguists find data [Resnik et al BLS 2005]
bull Source sentence analysis for machine translation [Xu et al 2009]
bull Relation extraction systems [Fundel et al Bioinformatics 2006]
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN V NPIN NP
DT NNDT NNthe
boyput
tortoisethethe rug
on
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN V NPINNP
DT NNDT NNthe
boyput
tortoisethethe rug
on
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN VNPINNP
DT NNDT NNthe
boy
put
tortoisethethe rug
on
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN VNPINNP
DT NNDT NNलड़क न रखाकछआकालीन
ऊपर
Pre 1990 (ldquoClassicalrdquo) NLP Parsing
bull Goes back to Chomskyrsquos PhD thesis in 1950s
bull Wrote symbolic grammar (CFG or often richer) and lexiconS NP VP NN interest
NP (DT) NN NNS rates
NP NN NNS NNS raises
NP NNP VBP interest
VP V NP VBZ rates
bull Used grammarproof systems to prove parses from words
bull This scaled very badly and didnrsquot give coverage For sentence
Fed raises interest rates 05 in effort to control inflationbull Minimal grammar 36 parses
bull Simple 10 rule grammar 592 parses
bull Real-size broad-coverage grammar millions of parses
Classical NLP ParsingThe problem and its solution
bull Categorical constraints can be added to grammars to limit unlikelyweird parses for sentencesbull But the attempt make the grammars not robust
bull In traditional systems commonly 30 of sentences in even an edited text would have no parse
bull A less constrained grammar can parse more sentencesbull But simple sentences end up with ever more parses with no way to
choose between them
bull We need mechanisms that allow us to find the most likely parse(s) for a sentencebull Statistical parsing lets us work with very loose grammars that admit
millions of parses for sentences but still quickly find the best parse(s)
Context Free Grammars and Ambiguities
20
Context-Free Grammars
21
Context-Free Grammars in NLP
bull A context free grammar G in NLP = (N C Σ S L R)bull Σ is a set of terminal symbols
bull C is a set of preterminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull L is the lexicon a set of items of the form X x
bull X isin C and x isin Σ
bull R is the grammar a set of items of the form X
bull X isin N and isin (N cup C)
bull By usual convention S is the start symbol but in statistical NLP we usually have an extra node at the top (ROOT TOP)
bull We usually write e for an empty sequence rather than nothing22
A Context Free Grammar of English
23
Left-Most Derivations
24
Properties of CFGs
25
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
bull Catalan numbers Cn
= (2n)[(n+1)n]
bull An exponentially growing series which arises in many tree-like contexts
bull Eg the number of possible triangulations of a polygon with n+2 sides
bull Turns up in triangulation of probabilistic graphical modelshellip
Attachments
bull I cleaned the dishes from dinner
bull I cleaned the dishes with detergent
bull I cleaned the dishes in my pajamas
bull I cleaned the dishes in the sink
Syntactic Ambiguities I
bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles
bull Particle vs prepositionThe lady dressed up the staircase
bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand
bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers
Syntactic Ambiguities II
bull Modifier scope within NPsimpractical design requirementsplastic cup holder
bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue
bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall
Non-Local Phenomena
bull Dislocation gappingbull Which book should Peter buy
bull A debate arose which continued until the election
bull Bindingbull Reference
bull The IRS audits itself
bull Controlbull I want to go
bull I want you to go
32
A Fragment of a Noun Phrase Grammar
33
Extended Grammar with Prepositional Phrases
34
Verbs Verb Phrases and Sentences
35
PPs Modifying Verb Phrases
36
Complementizers and SBARs
37
More Verbs
38
Coordination
39
Much more remainshellip
40
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve2 Choosing the correct parse
bull How do we work out the correct attachment
bull She saw the man with a telescope
bull Is the problem lsquoAI completersquo Yes but hellip
bull Words are good predictors of attachmentbull Even absent full understanding
bull Moscow sent more than 100000 soldiers into Afghanistan hellip
bull Sydney Water breached an agreement with NSW Health hellip
bull Our statistical parsers will try to exploit such statistics
Probabilistic Context Free Grammar
45
Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)
bull G = (Σ N S R P)bull T is a set of terminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull R is a set of rulesproductions of the form X
bull P is a probability function
bull P R [01]
bull
bull A grammar G generates a language model L
P(g) =1g IcircT
aring
PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
Example of a PCFG
48
Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
A Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
The man sleeps
The man saw the woman with the telescope
NNDT Vi
VPNP
NNDT
NP
NNDT
NP
NNDT
NPVt
VP
IN
PP
VP
S
S
t1=
p(t1)=100310070410
10
0403
10 07 10
t2=
p(ts)=180310070204100310020405031001
10
03 03 03
02
04 04
0510
10 10 1007 02 01
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
The rise of annotated data
bull Starting off building a treebank seems a lot slower and less useful than building a grammar
bull But a treebank gives us many thingsbull Reusability of the labor
bull Many parsers POS taggers etc
bull Valuable resource for linguistics
bull Broad coverage
bull Frequencies and distributional information
bull A way to evaluate systems
Statistical parsing applications
Statistical parsers are now robust and widely used in larger NLP applications
bull High precision question answering [Pasca and Harabagiu SIGIR 2001]
bull Improving biological named entity finding [Finkel et al JNLPBA 2004]
bull Syntactically based sentence compression [Lin and Wilbur 2007]
bull Extracting opinions about products [Bloom et al NAACL 2007]
bull Improved interaction in computer games [Gorniak and Roy 2005]
bull Helping linguists find data [Resnik et al BLS 2005]
bull Source sentence analysis for machine translation [Xu et al 2009]
bull Relation extraction systems [Fundel et al Bioinformatics 2006]
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN V NPIN NP
DT NNDT NNthe
boyput
tortoisethethe rug
on
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN V NPINNP
DT NNDT NNthe
boyput
tortoisethethe rug
on
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN VNPINNP
DT NNDT NNthe
boy
put
tortoisethethe rug
on
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN VNPINNP
DT NNDT NNलड़क न रखाकछआकालीन
ऊपर
Pre 1990 (ldquoClassicalrdquo) NLP Parsing
bull Goes back to Chomskyrsquos PhD thesis in 1950s
bull Wrote symbolic grammar (CFG or often richer) and lexiconS NP VP NN interest
NP (DT) NN NNS rates
NP NN NNS NNS raises
NP NNP VBP interest
VP V NP VBZ rates
bull Used grammarproof systems to prove parses from words
bull This scaled very badly and didnrsquot give coverage For sentence
Fed raises interest rates 05 in effort to control inflationbull Minimal grammar 36 parses
bull Simple 10 rule grammar 592 parses
bull Real-size broad-coverage grammar millions of parses
Classical NLP ParsingThe problem and its solution
bull Categorical constraints can be added to grammars to limit unlikelyweird parses for sentencesbull But the attempt make the grammars not robust
bull In traditional systems commonly 30 of sentences in even an edited text would have no parse
bull A less constrained grammar can parse more sentencesbull But simple sentences end up with ever more parses with no way to
choose between them
bull We need mechanisms that allow us to find the most likely parse(s) for a sentencebull Statistical parsing lets us work with very loose grammars that admit
millions of parses for sentences but still quickly find the best parse(s)
Context Free Grammars and Ambiguities
20
Context-Free Grammars
21
Context-Free Grammars in NLP
bull A context free grammar G in NLP = (N C Σ S L R)bull Σ is a set of terminal symbols
bull C is a set of preterminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull L is the lexicon a set of items of the form X x
bull X isin C and x isin Σ
bull R is the grammar a set of items of the form X
bull X isin N and isin (N cup C)
bull By usual convention S is the start symbol but in statistical NLP we usually have an extra node at the top (ROOT TOP)
bull We usually write e for an empty sequence rather than nothing22
A Context Free Grammar of English
23
Left-Most Derivations
24
Properties of CFGs
25
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
bull Catalan numbers Cn
= (2n)[(n+1)n]
bull An exponentially growing series which arises in many tree-like contexts
bull Eg the number of possible triangulations of a polygon with n+2 sides
bull Turns up in triangulation of probabilistic graphical modelshellip
Attachments
bull I cleaned the dishes from dinner
bull I cleaned the dishes with detergent
bull I cleaned the dishes in my pajamas
bull I cleaned the dishes in the sink
Syntactic Ambiguities I
bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles
bull Particle vs prepositionThe lady dressed up the staircase
bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand
bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers
Syntactic Ambiguities II
bull Modifier scope within NPsimpractical design requirementsplastic cup holder
bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue
bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall
Non-Local Phenomena
bull Dislocation gappingbull Which book should Peter buy
bull A debate arose which continued until the election
bull Bindingbull Reference
bull The IRS audits itself
bull Controlbull I want to go
bull I want you to go
32
A Fragment of a Noun Phrase Grammar
33
Extended Grammar with Prepositional Phrases
34
Verbs Verb Phrases and Sentences
35
PPs Modifying Verb Phrases
36
Complementizers and SBARs
37
More Verbs
38
Coordination
39
Much more remainshellip
40
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve2 Choosing the correct parse
bull How do we work out the correct attachment
bull She saw the man with a telescope
bull Is the problem lsquoAI completersquo Yes but hellip
bull Words are good predictors of attachmentbull Even absent full understanding
bull Moscow sent more than 100000 soldiers into Afghanistan hellip
bull Sydney Water breached an agreement with NSW Health hellip
bull Our statistical parsers will try to exploit such statistics
Probabilistic Context Free Grammar
45
Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)
bull G = (Σ N S R P)bull T is a set of terminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull R is a set of rulesproductions of the form X
bull P is a probability function
bull P R [01]
bull
bull A grammar G generates a language model L
P(g) =1g IcircT
aring
PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
Example of a PCFG
48
Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
A Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
The man sleeps
The man saw the woman with the telescope
NNDT Vi
VPNP
NNDT
NP
NNDT
NP
NNDT
NPVt
VP
IN
PP
VP
S
S
t1=
p(t1)=100310070410
10
0403
10 07 10
t2=
p(ts)=180310070204100310020405031001
10
03 03 03
02
04 04
0510
10 10 1007 02 01
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Statistical parsing applications
Statistical parsers are now robust and widely used in larger NLP applications
bull High precision question answering [Pasca and Harabagiu SIGIR 2001]
bull Improving biological named entity finding [Finkel et al JNLPBA 2004]
bull Syntactically based sentence compression [Lin and Wilbur 2007]
bull Extracting opinions about products [Bloom et al NAACL 2007]
bull Improved interaction in computer games [Gorniak and Roy 2005]
bull Helping linguists find data [Resnik et al BLS 2005]
bull Source sentence analysis for machine translation [Xu et al 2009]
bull Relation extraction systems [Fundel et al Bioinformatics 2006]
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN V NPIN NP
DT NNDT NNthe
boyput
tortoisethethe rug
on
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN V NPINNP
DT NNDT NNthe
boyput
tortoisethethe rug
on
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN VNPINNP
DT NNDT NNthe
boy
put
tortoisethethe rug
on
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN VNPINNP
DT NNDT NNलड़क न रखाकछआकालीन
ऊपर
Pre 1990 (ldquoClassicalrdquo) NLP Parsing
bull Goes back to Chomskyrsquos PhD thesis in 1950s
bull Wrote symbolic grammar (CFG or often richer) and lexiconS NP VP NN interest
NP (DT) NN NNS rates
NP NN NNS NNS raises
NP NNP VBP interest
VP V NP VBZ rates
bull Used grammarproof systems to prove parses from words
bull This scaled very badly and didnrsquot give coverage For sentence
Fed raises interest rates 05 in effort to control inflationbull Minimal grammar 36 parses
bull Simple 10 rule grammar 592 parses
bull Real-size broad-coverage grammar millions of parses
Classical NLP ParsingThe problem and its solution
bull Categorical constraints can be added to grammars to limit unlikelyweird parses for sentencesbull But the attempt make the grammars not robust
bull In traditional systems commonly 30 of sentences in even an edited text would have no parse
bull A less constrained grammar can parse more sentencesbull But simple sentences end up with ever more parses with no way to
choose between them
bull We need mechanisms that allow us to find the most likely parse(s) for a sentencebull Statistical parsing lets us work with very loose grammars that admit
millions of parses for sentences but still quickly find the best parse(s)
Context Free Grammars and Ambiguities
20
Context-Free Grammars
21
Context-Free Grammars in NLP
bull A context free grammar G in NLP = (N C Σ S L R)bull Σ is a set of terminal symbols
bull C is a set of preterminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull L is the lexicon a set of items of the form X x
bull X isin C and x isin Σ
bull R is the grammar a set of items of the form X
bull X isin N and isin (N cup C)
bull By usual convention S is the start symbol but in statistical NLP we usually have an extra node at the top (ROOT TOP)
bull We usually write e for an empty sequence rather than nothing22
A Context Free Grammar of English
23
Left-Most Derivations
24
Properties of CFGs
25
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
bull Catalan numbers Cn
= (2n)[(n+1)n]
bull An exponentially growing series which arises in many tree-like contexts
bull Eg the number of possible triangulations of a polygon with n+2 sides
bull Turns up in triangulation of probabilistic graphical modelshellip
Attachments
bull I cleaned the dishes from dinner
bull I cleaned the dishes with detergent
bull I cleaned the dishes in my pajamas
bull I cleaned the dishes in the sink
Syntactic Ambiguities I
bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles
bull Particle vs prepositionThe lady dressed up the staircase
bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand
bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers
Syntactic Ambiguities II
bull Modifier scope within NPsimpractical design requirementsplastic cup holder
bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue
bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall
Non-Local Phenomena
bull Dislocation gappingbull Which book should Peter buy
bull A debate arose which continued until the election
bull Bindingbull Reference
bull The IRS audits itself
bull Controlbull I want to go
bull I want you to go
32
A Fragment of a Noun Phrase Grammar
33
Extended Grammar with Prepositional Phrases
34
Verbs Verb Phrases and Sentences
35
PPs Modifying Verb Phrases
36
Complementizers and SBARs
37
More Verbs
38
Coordination
39
Much more remainshellip
40
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve2 Choosing the correct parse
bull How do we work out the correct attachment
bull She saw the man with a telescope
bull Is the problem lsquoAI completersquo Yes but hellip
bull Words are good predictors of attachmentbull Even absent full understanding
bull Moscow sent more than 100000 soldiers into Afghanistan hellip
bull Sydney Water breached an agreement with NSW Health hellip
bull Our statistical parsers will try to exploit such statistics
Probabilistic Context Free Grammar
45
Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)
bull G = (Σ N S R P)bull T is a set of terminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull R is a set of rulesproductions of the form X
bull P is a probability function
bull P R [01]
bull
bull A grammar G generates a language model L
P(g) =1g IcircT
aring
PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
Example of a PCFG
48
Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
A Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
The man sleeps
The man saw the woman with the telescope
NNDT Vi
VPNP
NNDT
NP
NNDT
NP
NNDT
NPVt
VP
IN
PP
VP
S
S
t1=
p(t1)=100310070410
10
0403
10 07 10
t2=
p(ts)=180310070204100310020405031001
10
03 03 03
02
04 04
0510
10 10 1007 02 01
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN V NPIN NP
DT NNDT NNthe
boyput
tortoisethethe rug
on
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN V NPINNP
DT NNDT NNthe
boyput
tortoisethethe rug
on
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN VNPINNP
DT NNDT NNthe
boy
put
tortoisethethe rug
on
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN VNPINNP
DT NNDT NNलड़क न रखाकछआकालीन
ऊपर
Pre 1990 (ldquoClassicalrdquo) NLP Parsing
bull Goes back to Chomskyrsquos PhD thesis in 1950s
bull Wrote symbolic grammar (CFG or often richer) and lexiconS NP VP NN interest
NP (DT) NN NNS rates
NP NN NNS NNS raises
NP NNP VBP interest
VP V NP VBZ rates
bull Used grammarproof systems to prove parses from words
bull This scaled very badly and didnrsquot give coverage For sentence
Fed raises interest rates 05 in effort to control inflationbull Minimal grammar 36 parses
bull Simple 10 rule grammar 592 parses
bull Real-size broad-coverage grammar millions of parses
Classical NLP ParsingThe problem and its solution
bull Categorical constraints can be added to grammars to limit unlikelyweird parses for sentencesbull But the attempt make the grammars not robust
bull In traditional systems commonly 30 of sentences in even an edited text would have no parse
bull A less constrained grammar can parse more sentencesbull But simple sentences end up with ever more parses with no way to
choose between them
bull We need mechanisms that allow us to find the most likely parse(s) for a sentencebull Statistical parsing lets us work with very loose grammars that admit
millions of parses for sentences but still quickly find the best parse(s)
Context Free Grammars and Ambiguities
20
Context-Free Grammars
21
Context-Free Grammars in NLP
bull A context free grammar G in NLP = (N C Σ S L R)bull Σ is a set of terminal symbols
bull C is a set of preterminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull L is the lexicon a set of items of the form X x
bull X isin C and x isin Σ
bull R is the grammar a set of items of the form X
bull X isin N and isin (N cup C)
bull By usual convention S is the start symbol but in statistical NLP we usually have an extra node at the top (ROOT TOP)
bull We usually write e for an empty sequence rather than nothing22
A Context Free Grammar of English
23
Left-Most Derivations
24
Properties of CFGs
25
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
bull Catalan numbers Cn
= (2n)[(n+1)n]
bull An exponentially growing series which arises in many tree-like contexts
bull Eg the number of possible triangulations of a polygon with n+2 sides
bull Turns up in triangulation of probabilistic graphical modelshellip
Attachments
bull I cleaned the dishes from dinner
bull I cleaned the dishes with detergent
bull I cleaned the dishes in my pajamas
bull I cleaned the dishes in the sink
Syntactic Ambiguities I
bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles
bull Particle vs prepositionThe lady dressed up the staircase
bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand
bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers
Syntactic Ambiguities II
bull Modifier scope within NPsimpractical design requirementsplastic cup holder
bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue
bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall
Non-Local Phenomena
bull Dislocation gappingbull Which book should Peter buy
bull A debate arose which continued until the election
bull Bindingbull Reference
bull The IRS audits itself
bull Controlbull I want to go
bull I want you to go
32
A Fragment of a Noun Phrase Grammar
33
Extended Grammar with Prepositional Phrases
34
Verbs Verb Phrases and Sentences
35
PPs Modifying Verb Phrases
36
Complementizers and SBARs
37
More Verbs
38
Coordination
39
Much more remainshellip
40
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve2 Choosing the correct parse
bull How do we work out the correct attachment
bull She saw the man with a telescope
bull Is the problem lsquoAI completersquo Yes but hellip
bull Words are good predictors of attachmentbull Even absent full understanding
bull Moscow sent more than 100000 soldiers into Afghanistan hellip
bull Sydney Water breached an agreement with NSW Health hellip
bull Our statistical parsers will try to exploit such statistics
Probabilistic Context Free Grammar
45
Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)
bull G = (Σ N S R P)bull T is a set of terminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull R is a set of rulesproductions of the form X
bull P is a probability function
bull P R [01]
bull
bull A grammar G generates a language model L
P(g) =1g IcircT
aring
PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
Example of a PCFG
48
Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
A Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
The man sleeps
The man saw the woman with the telescope
NNDT Vi
VPNP
NNDT
NP
NNDT
NP
NNDT
NPVt
VP
IN
PP
VP
S
S
t1=
p(t1)=100310070410
10
0403
10 07 10
t2=
p(ts)=180310070204100310020405031001
10
03 03 03
02
04 04
0510
10 10 1007 02 01
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN V NPIN NP
DT NNDT NNthe
boyput
tortoisethethe rug
on
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN V NPINNP
DT NNDT NNthe
boyput
tortoisethethe rug
on
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN VNPINNP
DT NNDT NNthe
boy
put
tortoisethethe rug
on
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN VNPINNP
DT NNDT NNलड़क न रखाकछआकालीन
ऊपर
Pre 1990 (ldquoClassicalrdquo) NLP Parsing
bull Goes back to Chomskyrsquos PhD thesis in 1950s
bull Wrote symbolic grammar (CFG or often richer) and lexiconS NP VP NN interest
NP (DT) NN NNS rates
NP NN NNS NNS raises
NP NNP VBP interest
VP V NP VBZ rates
bull Used grammarproof systems to prove parses from words
bull This scaled very badly and didnrsquot give coverage For sentence
Fed raises interest rates 05 in effort to control inflationbull Minimal grammar 36 parses
bull Simple 10 rule grammar 592 parses
bull Real-size broad-coverage grammar millions of parses
Classical NLP ParsingThe problem and its solution
bull Categorical constraints can be added to grammars to limit unlikelyweird parses for sentencesbull But the attempt make the grammars not robust
bull In traditional systems commonly 30 of sentences in even an edited text would have no parse
bull A less constrained grammar can parse more sentencesbull But simple sentences end up with ever more parses with no way to
choose between them
bull We need mechanisms that allow us to find the most likely parse(s) for a sentencebull Statistical parsing lets us work with very loose grammars that admit
millions of parses for sentences but still quickly find the best parse(s)
Context Free Grammars and Ambiguities
20
Context-Free Grammars
21
Context-Free Grammars in NLP
bull A context free grammar G in NLP = (N C Σ S L R)bull Σ is a set of terminal symbols
bull C is a set of preterminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull L is the lexicon a set of items of the form X x
bull X isin C and x isin Σ
bull R is the grammar a set of items of the form X
bull X isin N and isin (N cup C)
bull By usual convention S is the start symbol but in statistical NLP we usually have an extra node at the top (ROOT TOP)
bull We usually write e for an empty sequence rather than nothing22
A Context Free Grammar of English
23
Left-Most Derivations
24
Properties of CFGs
25
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
bull Catalan numbers Cn
= (2n)[(n+1)n]
bull An exponentially growing series which arises in many tree-like contexts
bull Eg the number of possible triangulations of a polygon with n+2 sides
bull Turns up in triangulation of probabilistic graphical modelshellip
Attachments
bull I cleaned the dishes from dinner
bull I cleaned the dishes with detergent
bull I cleaned the dishes in my pajamas
bull I cleaned the dishes in the sink
Syntactic Ambiguities I
bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles
bull Particle vs prepositionThe lady dressed up the staircase
bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand
bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers
Syntactic Ambiguities II
bull Modifier scope within NPsimpractical design requirementsplastic cup holder
bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue
bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall
Non-Local Phenomena
bull Dislocation gappingbull Which book should Peter buy
bull A debate arose which continued until the election
bull Bindingbull Reference
bull The IRS audits itself
bull Controlbull I want to go
bull I want you to go
32
A Fragment of a Noun Phrase Grammar
33
Extended Grammar with Prepositional Phrases
34
Verbs Verb Phrases and Sentences
35
PPs Modifying Verb Phrases
36
Complementizers and SBARs
37
More Verbs
38
Coordination
39
Much more remainshellip
40
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve2 Choosing the correct parse
bull How do we work out the correct attachment
bull She saw the man with a telescope
bull Is the problem lsquoAI completersquo Yes but hellip
bull Words are good predictors of attachmentbull Even absent full understanding
bull Moscow sent more than 100000 soldiers into Afghanistan hellip
bull Sydney Water breached an agreement with NSW Health hellip
bull Our statistical parsers will try to exploit such statistics
Probabilistic Context Free Grammar
45
Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)
bull G = (Σ N S R P)bull T is a set of terminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull R is a set of rulesproductions of the form X
bull P is a probability function
bull P R [01]
bull
bull A grammar G generates a language model L
P(g) =1g IcircT
aring
PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
Example of a PCFG
48
Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
A Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
The man sleeps
The man saw the woman with the telescope
NNDT Vi
VPNP
NNDT
NP
NNDT
NP
NNDT
NPVt
VP
IN
PP
VP
S
S
t1=
p(t1)=100310070410
10
0403
10 07 10
t2=
p(ts)=180310070204100310020405031001
10
03 03 03
02
04 04
0510
10 10 1007 02 01
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN V NPINNP
DT NNDT NNthe
boyput
tortoisethethe rug
on
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN VNPINNP
DT NNDT NNthe
boy
put
tortoisethethe rug
on
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN VNPINNP
DT NNDT NNलड़क न रखाकछआकालीन
ऊपर
Pre 1990 (ldquoClassicalrdquo) NLP Parsing
bull Goes back to Chomskyrsquos PhD thesis in 1950s
bull Wrote symbolic grammar (CFG or often richer) and lexiconS NP VP NN interest
NP (DT) NN NNS rates
NP NN NNS NNS raises
NP NNP VBP interest
VP V NP VBZ rates
bull Used grammarproof systems to prove parses from words
bull This scaled very badly and didnrsquot give coverage For sentence
Fed raises interest rates 05 in effort to control inflationbull Minimal grammar 36 parses
bull Simple 10 rule grammar 592 parses
bull Real-size broad-coverage grammar millions of parses
Classical NLP ParsingThe problem and its solution
bull Categorical constraints can be added to grammars to limit unlikelyweird parses for sentencesbull But the attempt make the grammars not robust
bull In traditional systems commonly 30 of sentences in even an edited text would have no parse
bull A less constrained grammar can parse more sentencesbull But simple sentences end up with ever more parses with no way to
choose between them
bull We need mechanisms that allow us to find the most likely parse(s) for a sentencebull Statistical parsing lets us work with very loose grammars that admit
millions of parses for sentences but still quickly find the best parse(s)
Context Free Grammars and Ambiguities
20
Context-Free Grammars
21
Context-Free Grammars in NLP
bull A context free grammar G in NLP = (N C Σ S L R)bull Σ is a set of terminal symbols
bull C is a set of preterminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull L is the lexicon a set of items of the form X x
bull X isin C and x isin Σ
bull R is the grammar a set of items of the form X
bull X isin N and isin (N cup C)
bull By usual convention S is the start symbol but in statistical NLP we usually have an extra node at the top (ROOT TOP)
bull We usually write e for an empty sequence rather than nothing22
A Context Free Grammar of English
23
Left-Most Derivations
24
Properties of CFGs
25
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
bull Catalan numbers Cn
= (2n)[(n+1)n]
bull An exponentially growing series which arises in many tree-like contexts
bull Eg the number of possible triangulations of a polygon with n+2 sides
bull Turns up in triangulation of probabilistic graphical modelshellip
Attachments
bull I cleaned the dishes from dinner
bull I cleaned the dishes with detergent
bull I cleaned the dishes in my pajamas
bull I cleaned the dishes in the sink
Syntactic Ambiguities I
bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles
bull Particle vs prepositionThe lady dressed up the staircase
bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand
bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers
Syntactic Ambiguities II
bull Modifier scope within NPsimpractical design requirementsplastic cup holder
bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue
bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall
Non-Local Phenomena
bull Dislocation gappingbull Which book should Peter buy
bull A debate arose which continued until the election
bull Bindingbull Reference
bull The IRS audits itself
bull Controlbull I want to go
bull I want you to go
32
A Fragment of a Noun Phrase Grammar
33
Extended Grammar with Prepositional Phrases
34
Verbs Verb Phrases and Sentences
35
PPs Modifying Verb Phrases
36
Complementizers and SBARs
37
More Verbs
38
Coordination
39
Much more remainshellip
40
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve2 Choosing the correct parse
bull How do we work out the correct attachment
bull She saw the man with a telescope
bull Is the problem lsquoAI completersquo Yes but hellip
bull Words are good predictors of attachmentbull Even absent full understanding
bull Moscow sent more than 100000 soldiers into Afghanistan hellip
bull Sydney Water breached an agreement with NSW Health hellip
bull Our statistical parsers will try to exploit such statistics
Probabilistic Context Free Grammar
45
Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)
bull G = (Σ N S R P)bull T is a set of terminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull R is a set of rulesproductions of the form X
bull P is a probability function
bull P R [01]
bull
bull A grammar G generates a language model L
P(g) =1g IcircT
aring
PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
Example of a PCFG
48
Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
A Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
The man sleeps
The man saw the woman with the telescope
NNDT Vi
VPNP
NNDT
NP
NNDT
NP
NNDT
NPVt
VP
IN
PP
VP
S
S
t1=
p(t1)=100310070410
10
0403
10 07 10
t2=
p(ts)=180310070204100310020405031001
10
03 03 03
02
04 04
0510
10 10 1007 02 01
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN VNPINNP
DT NNDT NNthe
boy
put
tortoisethethe rug
on
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN VNPINNP
DT NNDT NNलड़क न रखाकछआकालीन
ऊपर
Pre 1990 (ldquoClassicalrdquo) NLP Parsing
bull Goes back to Chomskyrsquos PhD thesis in 1950s
bull Wrote symbolic grammar (CFG or often richer) and lexiconS NP VP NN interest
NP (DT) NN NNS rates
NP NN NNS NNS raises
NP NNP VBP interest
VP V NP VBZ rates
bull Used grammarproof systems to prove parses from words
bull This scaled very badly and didnrsquot give coverage For sentence
Fed raises interest rates 05 in effort to control inflationbull Minimal grammar 36 parses
bull Simple 10 rule grammar 592 parses
bull Real-size broad-coverage grammar millions of parses
Classical NLP ParsingThe problem and its solution
bull Categorical constraints can be added to grammars to limit unlikelyweird parses for sentencesbull But the attempt make the grammars not robust
bull In traditional systems commonly 30 of sentences in even an edited text would have no parse
bull A less constrained grammar can parse more sentencesbull But simple sentences end up with ever more parses with no way to
choose between them
bull We need mechanisms that allow us to find the most likely parse(s) for a sentencebull Statistical parsing lets us work with very loose grammars that admit
millions of parses for sentences but still quickly find the best parse(s)
Context Free Grammars and Ambiguities
20
Context-Free Grammars
21
Context-Free Grammars in NLP
bull A context free grammar G in NLP = (N C Σ S L R)bull Σ is a set of terminal symbols
bull C is a set of preterminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull L is the lexicon a set of items of the form X x
bull X isin C and x isin Σ
bull R is the grammar a set of items of the form X
bull X isin N and isin (N cup C)
bull By usual convention S is the start symbol but in statistical NLP we usually have an extra node at the top (ROOT TOP)
bull We usually write e for an empty sequence rather than nothing22
A Context Free Grammar of English
23
Left-Most Derivations
24
Properties of CFGs
25
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
bull Catalan numbers Cn
= (2n)[(n+1)n]
bull An exponentially growing series which arises in many tree-like contexts
bull Eg the number of possible triangulations of a polygon with n+2 sides
bull Turns up in triangulation of probabilistic graphical modelshellip
Attachments
bull I cleaned the dishes from dinner
bull I cleaned the dishes with detergent
bull I cleaned the dishes in my pajamas
bull I cleaned the dishes in the sink
Syntactic Ambiguities I
bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles
bull Particle vs prepositionThe lady dressed up the staircase
bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand
bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers
Syntactic Ambiguities II
bull Modifier scope within NPsimpractical design requirementsplastic cup holder
bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue
bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall
Non-Local Phenomena
bull Dislocation gappingbull Which book should Peter buy
bull A debate arose which continued until the election
bull Bindingbull Reference
bull The IRS audits itself
bull Controlbull I want to go
bull I want you to go
32
A Fragment of a Noun Phrase Grammar
33
Extended Grammar with Prepositional Phrases
34
Verbs Verb Phrases and Sentences
35
PPs Modifying Verb Phrases
36
Complementizers and SBARs
37
More Verbs
38
Coordination
39
Much more remainshellip
40
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve2 Choosing the correct parse
bull How do we work out the correct attachment
bull She saw the man with a telescope
bull Is the problem lsquoAI completersquo Yes but hellip
bull Words are good predictors of attachmentbull Even absent full understanding
bull Moscow sent more than 100000 soldiers into Afghanistan hellip
bull Sydney Water breached an agreement with NSW Health hellip
bull Our statistical parsers will try to exploit such statistics
Probabilistic Context Free Grammar
45
Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)
bull G = (Σ N S R P)bull T is a set of terminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull R is a set of rulesproductions of the form X
bull P is a probability function
bull P R [01]
bull
bull A grammar G generates a language model L
P(g) =1g IcircT
aring
PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
Example of a PCFG
48
Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
A Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
The man sleeps
The man saw the woman with the telescope
NNDT Vi
VPNP
NNDT
NP
NNDT
NP
NNDT
NPVt
VP
IN
PP
VP
S
S
t1=
p(t1)=100310070410
10
0403
10 07 10
t2=
p(ts)=180310070204100310020405031001
10
03 03 03
02
04 04
0510
10 10 1007 02 01
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Example Application Machine Translation
bull The boy put the tortoise on the rug
bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position
S
NP VP PP
DT NN V NP IN NP
DT NN DT NNtheboy
put
tortoisethethe rug
on
S
NP VPPP
DT NN VNPINNP
DT NNDT NNलड़क न रखाकछआकालीन
ऊपर
Pre 1990 (ldquoClassicalrdquo) NLP Parsing
bull Goes back to Chomskyrsquos PhD thesis in 1950s
bull Wrote symbolic grammar (CFG or often richer) and lexiconS NP VP NN interest
NP (DT) NN NNS rates
NP NN NNS NNS raises
NP NNP VBP interest
VP V NP VBZ rates
bull Used grammarproof systems to prove parses from words
bull This scaled very badly and didnrsquot give coverage For sentence
Fed raises interest rates 05 in effort to control inflationbull Minimal grammar 36 parses
bull Simple 10 rule grammar 592 parses
bull Real-size broad-coverage grammar millions of parses
Classical NLP ParsingThe problem and its solution
bull Categorical constraints can be added to grammars to limit unlikelyweird parses for sentencesbull But the attempt make the grammars not robust
bull In traditional systems commonly 30 of sentences in even an edited text would have no parse
bull A less constrained grammar can parse more sentencesbull But simple sentences end up with ever more parses with no way to
choose between them
bull We need mechanisms that allow us to find the most likely parse(s) for a sentencebull Statistical parsing lets us work with very loose grammars that admit
millions of parses for sentences but still quickly find the best parse(s)
Context Free Grammars and Ambiguities
20
Context-Free Grammars
21
Context-Free Grammars in NLP
bull A context free grammar G in NLP = (N C Σ S L R)bull Σ is a set of terminal symbols
bull C is a set of preterminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull L is the lexicon a set of items of the form X x
bull X isin C and x isin Σ
bull R is the grammar a set of items of the form X
bull X isin N and isin (N cup C)
bull By usual convention S is the start symbol but in statistical NLP we usually have an extra node at the top (ROOT TOP)
bull We usually write e for an empty sequence rather than nothing22
A Context Free Grammar of English
23
Left-Most Derivations
24
Properties of CFGs
25
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
bull Catalan numbers Cn
= (2n)[(n+1)n]
bull An exponentially growing series which arises in many tree-like contexts
bull Eg the number of possible triangulations of a polygon with n+2 sides
bull Turns up in triangulation of probabilistic graphical modelshellip
Attachments
bull I cleaned the dishes from dinner
bull I cleaned the dishes with detergent
bull I cleaned the dishes in my pajamas
bull I cleaned the dishes in the sink
Syntactic Ambiguities I
bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles
bull Particle vs prepositionThe lady dressed up the staircase
bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand
bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers
Syntactic Ambiguities II
bull Modifier scope within NPsimpractical design requirementsplastic cup holder
bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue
bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall
Non-Local Phenomena
bull Dislocation gappingbull Which book should Peter buy
bull A debate arose which continued until the election
bull Bindingbull Reference
bull The IRS audits itself
bull Controlbull I want to go
bull I want you to go
32
A Fragment of a Noun Phrase Grammar
33
Extended Grammar with Prepositional Phrases
34
Verbs Verb Phrases and Sentences
35
PPs Modifying Verb Phrases
36
Complementizers and SBARs
37
More Verbs
38
Coordination
39
Much more remainshellip
40
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve2 Choosing the correct parse
bull How do we work out the correct attachment
bull She saw the man with a telescope
bull Is the problem lsquoAI completersquo Yes but hellip
bull Words are good predictors of attachmentbull Even absent full understanding
bull Moscow sent more than 100000 soldiers into Afghanistan hellip
bull Sydney Water breached an agreement with NSW Health hellip
bull Our statistical parsers will try to exploit such statistics
Probabilistic Context Free Grammar
45
Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)
bull G = (Σ N S R P)bull T is a set of terminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull R is a set of rulesproductions of the form X
bull P is a probability function
bull P R [01]
bull
bull A grammar G generates a language model L
P(g) =1g IcircT
aring
PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
Example of a PCFG
48
Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
A Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
The man sleeps
The man saw the woman with the telescope
NNDT Vi
VPNP
NNDT
NP
NNDT
NP
NNDT
NPVt
VP
IN
PP
VP
S
S
t1=
p(t1)=100310070410
10
0403
10 07 10
t2=
p(ts)=180310070204100310020405031001
10
03 03 03
02
04 04
0510
10 10 1007 02 01
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Pre 1990 (ldquoClassicalrdquo) NLP Parsing
bull Goes back to Chomskyrsquos PhD thesis in 1950s
bull Wrote symbolic grammar (CFG or often richer) and lexiconS NP VP NN interest
NP (DT) NN NNS rates
NP NN NNS NNS raises
NP NNP VBP interest
VP V NP VBZ rates
bull Used grammarproof systems to prove parses from words
bull This scaled very badly and didnrsquot give coverage For sentence
Fed raises interest rates 05 in effort to control inflationbull Minimal grammar 36 parses
bull Simple 10 rule grammar 592 parses
bull Real-size broad-coverage grammar millions of parses
Classical NLP ParsingThe problem and its solution
bull Categorical constraints can be added to grammars to limit unlikelyweird parses for sentencesbull But the attempt make the grammars not robust
bull In traditional systems commonly 30 of sentences in even an edited text would have no parse
bull A less constrained grammar can parse more sentencesbull But simple sentences end up with ever more parses with no way to
choose between them
bull We need mechanisms that allow us to find the most likely parse(s) for a sentencebull Statistical parsing lets us work with very loose grammars that admit
millions of parses for sentences but still quickly find the best parse(s)
Context Free Grammars and Ambiguities
20
Context-Free Grammars
21
Context-Free Grammars in NLP
bull A context free grammar G in NLP = (N C Σ S L R)bull Σ is a set of terminal symbols
bull C is a set of preterminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull L is the lexicon a set of items of the form X x
bull X isin C and x isin Σ
bull R is the grammar a set of items of the form X
bull X isin N and isin (N cup C)
bull By usual convention S is the start symbol but in statistical NLP we usually have an extra node at the top (ROOT TOP)
bull We usually write e for an empty sequence rather than nothing22
A Context Free Grammar of English
23
Left-Most Derivations
24
Properties of CFGs
25
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
bull Catalan numbers Cn
= (2n)[(n+1)n]
bull An exponentially growing series which arises in many tree-like contexts
bull Eg the number of possible triangulations of a polygon with n+2 sides
bull Turns up in triangulation of probabilistic graphical modelshellip
Attachments
bull I cleaned the dishes from dinner
bull I cleaned the dishes with detergent
bull I cleaned the dishes in my pajamas
bull I cleaned the dishes in the sink
Syntactic Ambiguities I
bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles
bull Particle vs prepositionThe lady dressed up the staircase
bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand
bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers
Syntactic Ambiguities II
bull Modifier scope within NPsimpractical design requirementsplastic cup holder
bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue
bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall
Non-Local Phenomena
bull Dislocation gappingbull Which book should Peter buy
bull A debate arose which continued until the election
bull Bindingbull Reference
bull The IRS audits itself
bull Controlbull I want to go
bull I want you to go
32
A Fragment of a Noun Phrase Grammar
33
Extended Grammar with Prepositional Phrases
34
Verbs Verb Phrases and Sentences
35
PPs Modifying Verb Phrases
36
Complementizers and SBARs
37
More Verbs
38
Coordination
39
Much more remainshellip
40
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve2 Choosing the correct parse
bull How do we work out the correct attachment
bull She saw the man with a telescope
bull Is the problem lsquoAI completersquo Yes but hellip
bull Words are good predictors of attachmentbull Even absent full understanding
bull Moscow sent more than 100000 soldiers into Afghanistan hellip
bull Sydney Water breached an agreement with NSW Health hellip
bull Our statistical parsers will try to exploit such statistics
Probabilistic Context Free Grammar
45
Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)
bull G = (Σ N S R P)bull T is a set of terminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull R is a set of rulesproductions of the form X
bull P is a probability function
bull P R [01]
bull
bull A grammar G generates a language model L
P(g) =1g IcircT
aring
PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
Example of a PCFG
48
Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
A Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
The man sleeps
The man saw the woman with the telescope
NNDT Vi
VPNP
NNDT
NP
NNDT
NP
NNDT
NPVt
VP
IN
PP
VP
S
S
t1=
p(t1)=100310070410
10
0403
10 07 10
t2=
p(ts)=180310070204100310020405031001
10
03 03 03
02
04 04
0510
10 10 1007 02 01
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Classical NLP ParsingThe problem and its solution
bull Categorical constraints can be added to grammars to limit unlikelyweird parses for sentencesbull But the attempt make the grammars not robust
bull In traditional systems commonly 30 of sentences in even an edited text would have no parse
bull A less constrained grammar can parse more sentencesbull But simple sentences end up with ever more parses with no way to
choose between them
bull We need mechanisms that allow us to find the most likely parse(s) for a sentencebull Statistical parsing lets us work with very loose grammars that admit
millions of parses for sentences but still quickly find the best parse(s)
Context Free Grammars and Ambiguities
20
Context-Free Grammars
21
Context-Free Grammars in NLP
bull A context free grammar G in NLP = (N C Σ S L R)bull Σ is a set of terminal symbols
bull C is a set of preterminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull L is the lexicon a set of items of the form X x
bull X isin C and x isin Σ
bull R is the grammar a set of items of the form X
bull X isin N and isin (N cup C)
bull By usual convention S is the start symbol but in statistical NLP we usually have an extra node at the top (ROOT TOP)
bull We usually write e for an empty sequence rather than nothing22
A Context Free Grammar of English
23
Left-Most Derivations
24
Properties of CFGs
25
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
bull Catalan numbers Cn
= (2n)[(n+1)n]
bull An exponentially growing series which arises in many tree-like contexts
bull Eg the number of possible triangulations of a polygon with n+2 sides
bull Turns up in triangulation of probabilistic graphical modelshellip
Attachments
bull I cleaned the dishes from dinner
bull I cleaned the dishes with detergent
bull I cleaned the dishes in my pajamas
bull I cleaned the dishes in the sink
Syntactic Ambiguities I
bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles
bull Particle vs prepositionThe lady dressed up the staircase
bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand
bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers
Syntactic Ambiguities II
bull Modifier scope within NPsimpractical design requirementsplastic cup holder
bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue
bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall
Non-Local Phenomena
bull Dislocation gappingbull Which book should Peter buy
bull A debate arose which continued until the election
bull Bindingbull Reference
bull The IRS audits itself
bull Controlbull I want to go
bull I want you to go
32
A Fragment of a Noun Phrase Grammar
33
Extended Grammar with Prepositional Phrases
34
Verbs Verb Phrases and Sentences
35
PPs Modifying Verb Phrases
36
Complementizers and SBARs
37
More Verbs
38
Coordination
39
Much more remainshellip
40
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve2 Choosing the correct parse
bull How do we work out the correct attachment
bull She saw the man with a telescope
bull Is the problem lsquoAI completersquo Yes but hellip
bull Words are good predictors of attachmentbull Even absent full understanding
bull Moscow sent more than 100000 soldiers into Afghanistan hellip
bull Sydney Water breached an agreement with NSW Health hellip
bull Our statistical parsers will try to exploit such statistics
Probabilistic Context Free Grammar
45
Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)
bull G = (Σ N S R P)bull T is a set of terminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull R is a set of rulesproductions of the form X
bull P is a probability function
bull P R [01]
bull
bull A grammar G generates a language model L
P(g) =1g IcircT
aring
PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
Example of a PCFG
48
Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
A Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
The man sleeps
The man saw the woman with the telescope
NNDT Vi
VPNP
NNDT
NP
NNDT
NP
NNDT
NPVt
VP
IN
PP
VP
S
S
t1=
p(t1)=100310070410
10
0403
10 07 10
t2=
p(ts)=180310070204100310020405031001
10
03 03 03
02
04 04
0510
10 10 1007 02 01
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Context Free Grammars and Ambiguities
20
Context-Free Grammars
21
Context-Free Grammars in NLP
bull A context free grammar G in NLP = (N C Σ S L R)bull Σ is a set of terminal symbols
bull C is a set of preterminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull L is the lexicon a set of items of the form X x
bull X isin C and x isin Σ
bull R is the grammar a set of items of the form X
bull X isin N and isin (N cup C)
bull By usual convention S is the start symbol but in statistical NLP we usually have an extra node at the top (ROOT TOP)
bull We usually write e for an empty sequence rather than nothing22
A Context Free Grammar of English
23
Left-Most Derivations
24
Properties of CFGs
25
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
bull Catalan numbers Cn
= (2n)[(n+1)n]
bull An exponentially growing series which arises in many tree-like contexts
bull Eg the number of possible triangulations of a polygon with n+2 sides
bull Turns up in triangulation of probabilistic graphical modelshellip
Attachments
bull I cleaned the dishes from dinner
bull I cleaned the dishes with detergent
bull I cleaned the dishes in my pajamas
bull I cleaned the dishes in the sink
Syntactic Ambiguities I
bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles
bull Particle vs prepositionThe lady dressed up the staircase
bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand
bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers
Syntactic Ambiguities II
bull Modifier scope within NPsimpractical design requirementsplastic cup holder
bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue
bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall
Non-Local Phenomena
bull Dislocation gappingbull Which book should Peter buy
bull A debate arose which continued until the election
bull Bindingbull Reference
bull The IRS audits itself
bull Controlbull I want to go
bull I want you to go
32
A Fragment of a Noun Phrase Grammar
33
Extended Grammar with Prepositional Phrases
34
Verbs Verb Phrases and Sentences
35
PPs Modifying Verb Phrases
36
Complementizers and SBARs
37
More Verbs
38
Coordination
39
Much more remainshellip
40
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve2 Choosing the correct parse
bull How do we work out the correct attachment
bull She saw the man with a telescope
bull Is the problem lsquoAI completersquo Yes but hellip
bull Words are good predictors of attachmentbull Even absent full understanding
bull Moscow sent more than 100000 soldiers into Afghanistan hellip
bull Sydney Water breached an agreement with NSW Health hellip
bull Our statistical parsers will try to exploit such statistics
Probabilistic Context Free Grammar
45
Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)
bull G = (Σ N S R P)bull T is a set of terminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull R is a set of rulesproductions of the form X
bull P is a probability function
bull P R [01]
bull
bull A grammar G generates a language model L
P(g) =1g IcircT
aring
PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
Example of a PCFG
48
Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
A Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
The man sleeps
The man saw the woman with the telescope
NNDT Vi
VPNP
NNDT
NP
NNDT
NP
NNDT
NPVt
VP
IN
PP
VP
S
S
t1=
p(t1)=100310070410
10
0403
10 07 10
t2=
p(ts)=180310070204100310020405031001
10
03 03 03
02
04 04
0510
10 10 1007 02 01
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Context-Free Grammars
21
Context-Free Grammars in NLP
bull A context free grammar G in NLP = (N C Σ S L R)bull Σ is a set of terminal symbols
bull C is a set of preterminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull L is the lexicon a set of items of the form X x
bull X isin C and x isin Σ
bull R is the grammar a set of items of the form X
bull X isin N and isin (N cup C)
bull By usual convention S is the start symbol but in statistical NLP we usually have an extra node at the top (ROOT TOP)
bull We usually write e for an empty sequence rather than nothing22
A Context Free Grammar of English
23
Left-Most Derivations
24
Properties of CFGs
25
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
bull Catalan numbers Cn
= (2n)[(n+1)n]
bull An exponentially growing series which arises in many tree-like contexts
bull Eg the number of possible triangulations of a polygon with n+2 sides
bull Turns up in triangulation of probabilistic graphical modelshellip
Attachments
bull I cleaned the dishes from dinner
bull I cleaned the dishes with detergent
bull I cleaned the dishes in my pajamas
bull I cleaned the dishes in the sink
Syntactic Ambiguities I
bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles
bull Particle vs prepositionThe lady dressed up the staircase
bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand
bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers
Syntactic Ambiguities II
bull Modifier scope within NPsimpractical design requirementsplastic cup holder
bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue
bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall
Non-Local Phenomena
bull Dislocation gappingbull Which book should Peter buy
bull A debate arose which continued until the election
bull Bindingbull Reference
bull The IRS audits itself
bull Controlbull I want to go
bull I want you to go
32
A Fragment of a Noun Phrase Grammar
33
Extended Grammar with Prepositional Phrases
34
Verbs Verb Phrases and Sentences
35
PPs Modifying Verb Phrases
36
Complementizers and SBARs
37
More Verbs
38
Coordination
39
Much more remainshellip
40
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve2 Choosing the correct parse
bull How do we work out the correct attachment
bull She saw the man with a telescope
bull Is the problem lsquoAI completersquo Yes but hellip
bull Words are good predictors of attachmentbull Even absent full understanding
bull Moscow sent more than 100000 soldiers into Afghanistan hellip
bull Sydney Water breached an agreement with NSW Health hellip
bull Our statistical parsers will try to exploit such statistics
Probabilistic Context Free Grammar
45
Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)
bull G = (Σ N S R P)bull T is a set of terminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull R is a set of rulesproductions of the form X
bull P is a probability function
bull P R [01]
bull
bull A grammar G generates a language model L
P(g) =1g IcircT
aring
PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
Example of a PCFG
48
Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
A Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
The man sleeps
The man saw the woman with the telescope
NNDT Vi
VPNP
NNDT
NP
NNDT
NP
NNDT
NPVt
VP
IN
PP
VP
S
S
t1=
p(t1)=100310070410
10
0403
10 07 10
t2=
p(ts)=180310070204100310020405031001
10
03 03 03
02
04 04
0510
10 10 1007 02 01
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Context-Free Grammars in NLP
bull A context free grammar G in NLP = (N C Σ S L R)bull Σ is a set of terminal symbols
bull C is a set of preterminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull L is the lexicon a set of items of the form X x
bull X isin C and x isin Σ
bull R is the grammar a set of items of the form X
bull X isin N and isin (N cup C)
bull By usual convention S is the start symbol but in statistical NLP we usually have an extra node at the top (ROOT TOP)
bull We usually write e for an empty sequence rather than nothing22
A Context Free Grammar of English
23
Left-Most Derivations
24
Properties of CFGs
25
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
bull Catalan numbers Cn
= (2n)[(n+1)n]
bull An exponentially growing series which arises in many tree-like contexts
bull Eg the number of possible triangulations of a polygon with n+2 sides
bull Turns up in triangulation of probabilistic graphical modelshellip
Attachments
bull I cleaned the dishes from dinner
bull I cleaned the dishes with detergent
bull I cleaned the dishes in my pajamas
bull I cleaned the dishes in the sink
Syntactic Ambiguities I
bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles
bull Particle vs prepositionThe lady dressed up the staircase
bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand
bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers
Syntactic Ambiguities II
bull Modifier scope within NPsimpractical design requirementsplastic cup holder
bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue
bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall
Non-Local Phenomena
bull Dislocation gappingbull Which book should Peter buy
bull A debate arose which continued until the election
bull Bindingbull Reference
bull The IRS audits itself
bull Controlbull I want to go
bull I want you to go
32
A Fragment of a Noun Phrase Grammar
33
Extended Grammar with Prepositional Phrases
34
Verbs Verb Phrases and Sentences
35
PPs Modifying Verb Phrases
36
Complementizers and SBARs
37
More Verbs
38
Coordination
39
Much more remainshellip
40
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve2 Choosing the correct parse
bull How do we work out the correct attachment
bull She saw the man with a telescope
bull Is the problem lsquoAI completersquo Yes but hellip
bull Words are good predictors of attachmentbull Even absent full understanding
bull Moscow sent more than 100000 soldiers into Afghanistan hellip
bull Sydney Water breached an agreement with NSW Health hellip
bull Our statistical parsers will try to exploit such statistics
Probabilistic Context Free Grammar
45
Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)
bull G = (Σ N S R P)bull T is a set of terminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull R is a set of rulesproductions of the form X
bull P is a probability function
bull P R [01]
bull
bull A grammar G generates a language model L
P(g) =1g IcircT
aring
PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
Example of a PCFG
48
Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
A Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
The man sleeps
The man saw the woman with the telescope
NNDT Vi
VPNP
NNDT
NP
NNDT
NP
NNDT
NPVt
VP
IN
PP
VP
S
S
t1=
p(t1)=100310070410
10
0403
10 07 10
t2=
p(ts)=180310070204100310020405031001
10
03 03 03
02
04 04
0510
10 10 1007 02 01
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
A Context Free Grammar of English
23
Left-Most Derivations
24
Properties of CFGs
25
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
bull Catalan numbers Cn
= (2n)[(n+1)n]
bull An exponentially growing series which arises in many tree-like contexts
bull Eg the number of possible triangulations of a polygon with n+2 sides
bull Turns up in triangulation of probabilistic graphical modelshellip
Attachments
bull I cleaned the dishes from dinner
bull I cleaned the dishes with detergent
bull I cleaned the dishes in my pajamas
bull I cleaned the dishes in the sink
Syntactic Ambiguities I
bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles
bull Particle vs prepositionThe lady dressed up the staircase
bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand
bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers
Syntactic Ambiguities II
bull Modifier scope within NPsimpractical design requirementsplastic cup holder
bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue
bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall
Non-Local Phenomena
bull Dislocation gappingbull Which book should Peter buy
bull A debate arose which continued until the election
bull Bindingbull Reference
bull The IRS audits itself
bull Controlbull I want to go
bull I want you to go
32
A Fragment of a Noun Phrase Grammar
33
Extended Grammar with Prepositional Phrases
34
Verbs Verb Phrases and Sentences
35
PPs Modifying Verb Phrases
36
Complementizers and SBARs
37
More Verbs
38
Coordination
39
Much more remainshellip
40
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve2 Choosing the correct parse
bull How do we work out the correct attachment
bull She saw the man with a telescope
bull Is the problem lsquoAI completersquo Yes but hellip
bull Words are good predictors of attachmentbull Even absent full understanding
bull Moscow sent more than 100000 soldiers into Afghanistan hellip
bull Sydney Water breached an agreement with NSW Health hellip
bull Our statistical parsers will try to exploit such statistics
Probabilistic Context Free Grammar
45
Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)
bull G = (Σ N S R P)bull T is a set of terminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull R is a set of rulesproductions of the form X
bull P is a probability function
bull P R [01]
bull
bull A grammar G generates a language model L
P(g) =1g IcircT
aring
PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
Example of a PCFG
48
Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
A Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
The man sleeps
The man saw the woman with the telescope
NNDT Vi
VPNP
NNDT
NP
NNDT
NP
NNDT
NPVt
VP
IN
PP
VP
S
S
t1=
p(t1)=100310070410
10
0403
10 07 10
t2=
p(ts)=180310070204100310020405031001
10
03 03 03
02
04 04
0510
10 10 1007 02 01
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Left-Most Derivations
24
Properties of CFGs
25
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
bull Catalan numbers Cn
= (2n)[(n+1)n]
bull An exponentially growing series which arises in many tree-like contexts
bull Eg the number of possible triangulations of a polygon with n+2 sides
bull Turns up in triangulation of probabilistic graphical modelshellip
Attachments
bull I cleaned the dishes from dinner
bull I cleaned the dishes with detergent
bull I cleaned the dishes in my pajamas
bull I cleaned the dishes in the sink
Syntactic Ambiguities I
bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles
bull Particle vs prepositionThe lady dressed up the staircase
bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand
bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers
Syntactic Ambiguities II
bull Modifier scope within NPsimpractical design requirementsplastic cup holder
bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue
bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall
Non-Local Phenomena
bull Dislocation gappingbull Which book should Peter buy
bull A debate arose which continued until the election
bull Bindingbull Reference
bull The IRS audits itself
bull Controlbull I want to go
bull I want you to go
32
A Fragment of a Noun Phrase Grammar
33
Extended Grammar with Prepositional Phrases
34
Verbs Verb Phrases and Sentences
35
PPs Modifying Verb Phrases
36
Complementizers and SBARs
37
More Verbs
38
Coordination
39
Much more remainshellip
40
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve2 Choosing the correct parse
bull How do we work out the correct attachment
bull She saw the man with a telescope
bull Is the problem lsquoAI completersquo Yes but hellip
bull Words are good predictors of attachmentbull Even absent full understanding
bull Moscow sent more than 100000 soldiers into Afghanistan hellip
bull Sydney Water breached an agreement with NSW Health hellip
bull Our statistical parsers will try to exploit such statistics
Probabilistic Context Free Grammar
45
Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)
bull G = (Σ N S R P)bull T is a set of terminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull R is a set of rulesproductions of the form X
bull P is a probability function
bull P R [01]
bull
bull A grammar G generates a language model L
P(g) =1g IcircT
aring
PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
Example of a PCFG
48
Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
A Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
The man sleeps
The man saw the woman with the telescope
NNDT Vi
VPNP
NNDT
NP
NNDT
NP
NNDT
NPVt
VP
IN
PP
VP
S
S
t1=
p(t1)=100310070410
10
0403
10 07 10
t2=
p(ts)=180310070204100310020405031001
10
03 03 03
02
04 04
0510
10 10 1007 02 01
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Properties of CFGs
25
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
bull Catalan numbers Cn
= (2n)[(n+1)n]
bull An exponentially growing series which arises in many tree-like contexts
bull Eg the number of possible triangulations of a polygon with n+2 sides
bull Turns up in triangulation of probabilistic graphical modelshellip
Attachments
bull I cleaned the dishes from dinner
bull I cleaned the dishes with detergent
bull I cleaned the dishes in my pajamas
bull I cleaned the dishes in the sink
Syntactic Ambiguities I
bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles
bull Particle vs prepositionThe lady dressed up the staircase
bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand
bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers
Syntactic Ambiguities II
bull Modifier scope within NPsimpractical design requirementsplastic cup holder
bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue
bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall
Non-Local Phenomena
bull Dislocation gappingbull Which book should Peter buy
bull A debate arose which continued until the election
bull Bindingbull Reference
bull The IRS audits itself
bull Controlbull I want to go
bull I want you to go
32
A Fragment of a Noun Phrase Grammar
33
Extended Grammar with Prepositional Phrases
34
Verbs Verb Phrases and Sentences
35
PPs Modifying Verb Phrases
36
Complementizers and SBARs
37
More Verbs
38
Coordination
39
Much more remainshellip
40
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve2 Choosing the correct parse
bull How do we work out the correct attachment
bull She saw the man with a telescope
bull Is the problem lsquoAI completersquo Yes but hellip
bull Words are good predictors of attachmentbull Even absent full understanding
bull Moscow sent more than 100000 soldiers into Afghanistan hellip
bull Sydney Water breached an agreement with NSW Health hellip
bull Our statistical parsers will try to exploit such statistics
Probabilistic Context Free Grammar
45
Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)
bull G = (Σ N S R P)bull T is a set of terminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull R is a set of rulesproductions of the form X
bull P is a probability function
bull P R [01]
bull
bull A grammar G generates a language model L
P(g) =1g IcircT
aring
PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
Example of a PCFG
48
Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
A Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
The man sleeps
The man saw the woman with the telescope
NNDT Vi
VPNP
NNDT
NP
NNDT
NP
NNDT
NPVt
VP
IN
PP
VP
S
S
t1=
p(t1)=100310070410
10
0403
10 07 10
t2=
p(ts)=180310070204100310020405031001
10
03 03 03
02
04 04
0510
10 10 1007 02 01
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
bull Catalan numbers Cn
= (2n)[(n+1)n]
bull An exponentially growing series which arises in many tree-like contexts
bull Eg the number of possible triangulations of a polygon with n+2 sides
bull Turns up in triangulation of probabilistic graphical modelshellip
Attachments
bull I cleaned the dishes from dinner
bull I cleaned the dishes with detergent
bull I cleaned the dishes in my pajamas
bull I cleaned the dishes in the sink
Syntactic Ambiguities I
bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles
bull Particle vs prepositionThe lady dressed up the staircase
bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand
bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers
Syntactic Ambiguities II
bull Modifier scope within NPsimpractical design requirementsplastic cup holder
bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue
bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall
Non-Local Phenomena
bull Dislocation gappingbull Which book should Peter buy
bull A debate arose which continued until the election
bull Bindingbull Reference
bull The IRS audits itself
bull Controlbull I want to go
bull I want you to go
32
A Fragment of a Noun Phrase Grammar
33
Extended Grammar with Prepositional Phrases
34
Verbs Verb Phrases and Sentences
35
PPs Modifying Verb Phrases
36
Complementizers and SBARs
37
More Verbs
38
Coordination
39
Much more remainshellip
40
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve2 Choosing the correct parse
bull How do we work out the correct attachment
bull She saw the man with a telescope
bull Is the problem lsquoAI completersquo Yes but hellip
bull Words are good predictors of attachmentbull Even absent full understanding
bull Moscow sent more than 100000 soldiers into Afghanistan hellip
bull Sydney Water breached an agreement with NSW Health hellip
bull Our statistical parsers will try to exploit such statistics
Probabilistic Context Free Grammar
45
Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)
bull G = (Σ N S R P)bull T is a set of terminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull R is a set of rulesproductions of the form X
bull P is a probability function
bull P R [01]
bull
bull A grammar G generates a language model L
P(g) =1g IcircT
aring
PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
Example of a PCFG
48
Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
A Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
The man sleeps
The man saw the woman with the telescope
NNDT Vi
VPNP
NNDT
NP
NNDT
NP
NNDT
NPVt
VP
IN
PP
VP
S
S
t1=
p(t1)=100310070410
10
0403
10 07 10
t2=
p(ts)=180310070204100310020405031001
10
03 03 03
02
04 04
0510
10 10 1007 02 01
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Attachment ambiguities
bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc
bull Catalan numbers Cn
= (2n)[(n+1)n]
bull An exponentially growing series which arises in many tree-like contexts
bull Eg the number of possible triangulations of a polygon with n+2 sides
bull Turns up in triangulation of probabilistic graphical modelshellip
Attachments
bull I cleaned the dishes from dinner
bull I cleaned the dishes with detergent
bull I cleaned the dishes in my pajamas
bull I cleaned the dishes in the sink
Syntactic Ambiguities I
bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles
bull Particle vs prepositionThe lady dressed up the staircase
bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand
bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers
Syntactic Ambiguities II
bull Modifier scope within NPsimpractical design requirementsplastic cup holder
bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue
bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall
Non-Local Phenomena
bull Dislocation gappingbull Which book should Peter buy
bull A debate arose which continued until the election
bull Bindingbull Reference
bull The IRS audits itself
bull Controlbull I want to go
bull I want you to go
32
A Fragment of a Noun Phrase Grammar
33
Extended Grammar with Prepositional Phrases
34
Verbs Verb Phrases and Sentences
35
PPs Modifying Verb Phrases
36
Complementizers and SBARs
37
More Verbs
38
Coordination
39
Much more remainshellip
40
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve2 Choosing the correct parse
bull How do we work out the correct attachment
bull She saw the man with a telescope
bull Is the problem lsquoAI completersquo Yes but hellip
bull Words are good predictors of attachmentbull Even absent full understanding
bull Moscow sent more than 100000 soldiers into Afghanistan hellip
bull Sydney Water breached an agreement with NSW Health hellip
bull Our statistical parsers will try to exploit such statistics
Probabilistic Context Free Grammar
45
Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)
bull G = (Σ N S R P)bull T is a set of terminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull R is a set of rulesproductions of the form X
bull P is a probability function
bull P R [01]
bull
bull A grammar G generates a language model L
P(g) =1g IcircT
aring
PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
Example of a PCFG
48
Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
A Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
The man sleeps
The man saw the woman with the telescope
NNDT Vi
VPNP
NNDT
NP
NNDT
NP
NNDT
NPVt
VP
IN
PP
VP
S
S
t1=
p(t1)=100310070410
10
0403
10 07 10
t2=
p(ts)=180310070204100310020405031001
10
03 03 03
02
04 04
0510
10 10 1007 02 01
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Attachments
bull I cleaned the dishes from dinner
bull I cleaned the dishes with detergent
bull I cleaned the dishes in my pajamas
bull I cleaned the dishes in the sink
Syntactic Ambiguities I
bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles
bull Particle vs prepositionThe lady dressed up the staircase
bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand
bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers
Syntactic Ambiguities II
bull Modifier scope within NPsimpractical design requirementsplastic cup holder
bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue
bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall
Non-Local Phenomena
bull Dislocation gappingbull Which book should Peter buy
bull A debate arose which continued until the election
bull Bindingbull Reference
bull The IRS audits itself
bull Controlbull I want to go
bull I want you to go
32
A Fragment of a Noun Phrase Grammar
33
Extended Grammar with Prepositional Phrases
34
Verbs Verb Phrases and Sentences
35
PPs Modifying Verb Phrases
36
Complementizers and SBARs
37
More Verbs
38
Coordination
39
Much more remainshellip
40
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve2 Choosing the correct parse
bull How do we work out the correct attachment
bull She saw the man with a telescope
bull Is the problem lsquoAI completersquo Yes but hellip
bull Words are good predictors of attachmentbull Even absent full understanding
bull Moscow sent more than 100000 soldiers into Afghanistan hellip
bull Sydney Water breached an agreement with NSW Health hellip
bull Our statistical parsers will try to exploit such statistics
Probabilistic Context Free Grammar
45
Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)
bull G = (Σ N S R P)bull T is a set of terminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull R is a set of rulesproductions of the form X
bull P is a probability function
bull P R [01]
bull
bull A grammar G generates a language model L
P(g) =1g IcircT
aring
PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
Example of a PCFG
48
Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
A Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
The man sleeps
The man saw the woman with the telescope
NNDT Vi
VPNP
NNDT
NP
NNDT
NP
NNDT
NPVt
VP
IN
PP
VP
S
S
t1=
p(t1)=100310070410
10
0403
10 07 10
t2=
p(ts)=180310070204100310020405031001
10
03 03 03
02
04 04
0510
10 10 1007 02 01
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Syntactic Ambiguities I
bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles
bull Particle vs prepositionThe lady dressed up the staircase
bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand
bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers
Syntactic Ambiguities II
bull Modifier scope within NPsimpractical design requirementsplastic cup holder
bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue
bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall
Non-Local Phenomena
bull Dislocation gappingbull Which book should Peter buy
bull A debate arose which continued until the election
bull Bindingbull Reference
bull The IRS audits itself
bull Controlbull I want to go
bull I want you to go
32
A Fragment of a Noun Phrase Grammar
33
Extended Grammar with Prepositional Phrases
34
Verbs Verb Phrases and Sentences
35
PPs Modifying Verb Phrases
36
Complementizers and SBARs
37
More Verbs
38
Coordination
39
Much more remainshellip
40
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve2 Choosing the correct parse
bull How do we work out the correct attachment
bull She saw the man with a telescope
bull Is the problem lsquoAI completersquo Yes but hellip
bull Words are good predictors of attachmentbull Even absent full understanding
bull Moscow sent more than 100000 soldiers into Afghanistan hellip
bull Sydney Water breached an agreement with NSW Health hellip
bull Our statistical parsers will try to exploit such statistics
Probabilistic Context Free Grammar
45
Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)
bull G = (Σ N S R P)bull T is a set of terminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull R is a set of rulesproductions of the form X
bull P is a probability function
bull P R [01]
bull
bull A grammar G generates a language model L
P(g) =1g IcircT
aring
PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
Example of a PCFG
48
Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
A Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
The man sleeps
The man saw the woman with the telescope
NNDT Vi
VPNP
NNDT
NP
NNDT
NP
NNDT
NPVt
VP
IN
PP
VP
S
S
t1=
p(t1)=100310070410
10
0403
10 07 10
t2=
p(ts)=180310070204100310020405031001
10
03 03 03
02
04 04
0510
10 10 1007 02 01
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Syntactic Ambiguities II
bull Modifier scope within NPsimpractical design requirementsplastic cup holder
bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue
bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall
Non-Local Phenomena
bull Dislocation gappingbull Which book should Peter buy
bull A debate arose which continued until the election
bull Bindingbull Reference
bull The IRS audits itself
bull Controlbull I want to go
bull I want you to go
32
A Fragment of a Noun Phrase Grammar
33
Extended Grammar with Prepositional Phrases
34
Verbs Verb Phrases and Sentences
35
PPs Modifying Verb Phrases
36
Complementizers and SBARs
37
More Verbs
38
Coordination
39
Much more remainshellip
40
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve2 Choosing the correct parse
bull How do we work out the correct attachment
bull She saw the man with a telescope
bull Is the problem lsquoAI completersquo Yes but hellip
bull Words are good predictors of attachmentbull Even absent full understanding
bull Moscow sent more than 100000 soldiers into Afghanistan hellip
bull Sydney Water breached an agreement with NSW Health hellip
bull Our statistical parsers will try to exploit such statistics
Probabilistic Context Free Grammar
45
Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)
bull G = (Σ N S R P)bull T is a set of terminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull R is a set of rulesproductions of the form X
bull P is a probability function
bull P R [01]
bull
bull A grammar G generates a language model L
P(g) =1g IcircT
aring
PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
Example of a PCFG
48
Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
A Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
The man sleeps
The man saw the woman with the telescope
NNDT Vi
VPNP
NNDT
NP
NNDT
NP
NNDT
NPVt
VP
IN
PP
VP
S
S
t1=
p(t1)=100310070410
10
0403
10 07 10
t2=
p(ts)=180310070204100310020405031001
10
03 03 03
02
04 04
0510
10 10 1007 02 01
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Non-Local Phenomena
bull Dislocation gappingbull Which book should Peter buy
bull A debate arose which continued until the election
bull Bindingbull Reference
bull The IRS audits itself
bull Controlbull I want to go
bull I want you to go
32
A Fragment of a Noun Phrase Grammar
33
Extended Grammar with Prepositional Phrases
34
Verbs Verb Phrases and Sentences
35
PPs Modifying Verb Phrases
36
Complementizers and SBARs
37
More Verbs
38
Coordination
39
Much more remainshellip
40
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve2 Choosing the correct parse
bull How do we work out the correct attachment
bull She saw the man with a telescope
bull Is the problem lsquoAI completersquo Yes but hellip
bull Words are good predictors of attachmentbull Even absent full understanding
bull Moscow sent more than 100000 soldiers into Afghanistan hellip
bull Sydney Water breached an agreement with NSW Health hellip
bull Our statistical parsers will try to exploit such statistics
Probabilistic Context Free Grammar
45
Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)
bull G = (Σ N S R P)bull T is a set of terminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull R is a set of rulesproductions of the form X
bull P is a probability function
bull P R [01]
bull
bull A grammar G generates a language model L
P(g) =1g IcircT
aring
PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
Example of a PCFG
48
Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
A Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
The man sleeps
The man saw the woman with the telescope
NNDT Vi
VPNP
NNDT
NP
NNDT
NP
NNDT
NPVt
VP
IN
PP
VP
S
S
t1=
p(t1)=100310070410
10
0403
10 07 10
t2=
p(ts)=180310070204100310020405031001
10
03 03 03
02
04 04
0510
10 10 1007 02 01
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
32
A Fragment of a Noun Phrase Grammar
33
Extended Grammar with Prepositional Phrases
34
Verbs Verb Phrases and Sentences
35
PPs Modifying Verb Phrases
36
Complementizers and SBARs
37
More Verbs
38
Coordination
39
Much more remainshellip
40
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve2 Choosing the correct parse
bull How do we work out the correct attachment
bull She saw the man with a telescope
bull Is the problem lsquoAI completersquo Yes but hellip
bull Words are good predictors of attachmentbull Even absent full understanding
bull Moscow sent more than 100000 soldiers into Afghanistan hellip
bull Sydney Water breached an agreement with NSW Health hellip
bull Our statistical parsers will try to exploit such statistics
Probabilistic Context Free Grammar
45
Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)
bull G = (Σ N S R P)bull T is a set of terminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull R is a set of rulesproductions of the form X
bull P is a probability function
bull P R [01]
bull
bull A grammar G generates a language model L
P(g) =1g IcircT
aring
PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
Example of a PCFG
48
Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
A Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
The man sleeps
The man saw the woman with the telescope
NNDT Vi
VPNP
NNDT
NP
NNDT
NP
NNDT
NPVt
VP
IN
PP
VP
S
S
t1=
p(t1)=100310070410
10
0403
10 07 10
t2=
p(ts)=180310070204100310020405031001
10
03 03 03
02
04 04
0510
10 10 1007 02 01
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
A Fragment of a Noun Phrase Grammar
33
Extended Grammar with Prepositional Phrases
34
Verbs Verb Phrases and Sentences
35
PPs Modifying Verb Phrases
36
Complementizers and SBARs
37
More Verbs
38
Coordination
39
Much more remainshellip
40
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve2 Choosing the correct parse
bull How do we work out the correct attachment
bull She saw the man with a telescope
bull Is the problem lsquoAI completersquo Yes but hellip
bull Words are good predictors of attachmentbull Even absent full understanding
bull Moscow sent more than 100000 soldiers into Afghanistan hellip
bull Sydney Water breached an agreement with NSW Health hellip
bull Our statistical parsers will try to exploit such statistics
Probabilistic Context Free Grammar
45
Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)
bull G = (Σ N S R P)bull T is a set of terminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull R is a set of rulesproductions of the form X
bull P is a probability function
bull P R [01]
bull
bull A grammar G generates a language model L
P(g) =1g IcircT
aring
PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
Example of a PCFG
48
Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
A Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
The man sleeps
The man saw the woman with the telescope
NNDT Vi
VPNP
NNDT
NP
NNDT
NP
NNDT
NPVt
VP
IN
PP
VP
S
S
t1=
p(t1)=100310070410
10
0403
10 07 10
t2=
p(ts)=180310070204100310020405031001
10
03 03 03
02
04 04
0510
10 10 1007 02 01
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Extended Grammar with Prepositional Phrases
34
Verbs Verb Phrases and Sentences
35
PPs Modifying Verb Phrases
36
Complementizers and SBARs
37
More Verbs
38
Coordination
39
Much more remainshellip
40
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve2 Choosing the correct parse
bull How do we work out the correct attachment
bull She saw the man with a telescope
bull Is the problem lsquoAI completersquo Yes but hellip
bull Words are good predictors of attachmentbull Even absent full understanding
bull Moscow sent more than 100000 soldiers into Afghanistan hellip
bull Sydney Water breached an agreement with NSW Health hellip
bull Our statistical parsers will try to exploit such statistics
Probabilistic Context Free Grammar
45
Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)
bull G = (Σ N S R P)bull T is a set of terminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull R is a set of rulesproductions of the form X
bull P is a probability function
bull P R [01]
bull
bull A grammar G generates a language model L
P(g) =1g IcircT
aring
PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
Example of a PCFG
48
Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
A Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
The man sleeps
The man saw the woman with the telescope
NNDT Vi
VPNP
NNDT
NP
NNDT
NP
NNDT
NPVt
VP
IN
PP
VP
S
S
t1=
p(t1)=100310070410
10
0403
10 07 10
t2=
p(ts)=180310070204100310020405031001
10
03 03 03
02
04 04
0510
10 10 1007 02 01
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Verbs Verb Phrases and Sentences
35
PPs Modifying Verb Phrases
36
Complementizers and SBARs
37
More Verbs
38
Coordination
39
Much more remainshellip
40
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve2 Choosing the correct parse
bull How do we work out the correct attachment
bull She saw the man with a telescope
bull Is the problem lsquoAI completersquo Yes but hellip
bull Words are good predictors of attachmentbull Even absent full understanding
bull Moscow sent more than 100000 soldiers into Afghanistan hellip
bull Sydney Water breached an agreement with NSW Health hellip
bull Our statistical parsers will try to exploit such statistics
Probabilistic Context Free Grammar
45
Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)
bull G = (Σ N S R P)bull T is a set of terminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull R is a set of rulesproductions of the form X
bull P is a probability function
bull P R [01]
bull
bull A grammar G generates a language model L
P(g) =1g IcircT
aring
PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
Example of a PCFG
48
Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
A Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
The man sleeps
The man saw the woman with the telescope
NNDT Vi
VPNP
NNDT
NP
NNDT
NP
NNDT
NPVt
VP
IN
PP
VP
S
S
t1=
p(t1)=100310070410
10
0403
10 07 10
t2=
p(ts)=180310070204100310020405031001
10
03 03 03
02
04 04
0510
10 10 1007 02 01
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
PPs Modifying Verb Phrases
36
Complementizers and SBARs
37
More Verbs
38
Coordination
39
Much more remainshellip
40
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve2 Choosing the correct parse
bull How do we work out the correct attachment
bull She saw the man with a telescope
bull Is the problem lsquoAI completersquo Yes but hellip
bull Words are good predictors of attachmentbull Even absent full understanding
bull Moscow sent more than 100000 soldiers into Afghanistan hellip
bull Sydney Water breached an agreement with NSW Health hellip
bull Our statistical parsers will try to exploit such statistics
Probabilistic Context Free Grammar
45
Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)
bull G = (Σ N S R P)bull T is a set of terminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull R is a set of rulesproductions of the form X
bull P is a probability function
bull P R [01]
bull
bull A grammar G generates a language model L
P(g) =1g IcircT
aring
PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
Example of a PCFG
48
Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
A Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
The man sleeps
The man saw the woman with the telescope
NNDT Vi
VPNP
NNDT
NP
NNDT
NP
NNDT
NPVt
VP
IN
PP
VP
S
S
t1=
p(t1)=100310070410
10
0403
10 07 10
t2=
p(ts)=180310070204100310020405031001
10
03 03 03
02
04 04
0510
10 10 1007 02 01
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Complementizers and SBARs
37
More Verbs
38
Coordination
39
Much more remainshellip
40
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve2 Choosing the correct parse
bull How do we work out the correct attachment
bull She saw the man with a telescope
bull Is the problem lsquoAI completersquo Yes but hellip
bull Words are good predictors of attachmentbull Even absent full understanding
bull Moscow sent more than 100000 soldiers into Afghanistan hellip
bull Sydney Water breached an agreement with NSW Health hellip
bull Our statistical parsers will try to exploit such statistics
Probabilistic Context Free Grammar
45
Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)
bull G = (Σ N S R P)bull T is a set of terminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull R is a set of rulesproductions of the form X
bull P is a probability function
bull P R [01]
bull
bull A grammar G generates a language model L
P(g) =1g IcircT
aring
PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
Example of a PCFG
48
Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
A Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
The man sleeps
The man saw the woman with the telescope
NNDT Vi
VPNP
NNDT
NP
NNDT
NP
NNDT
NPVt
VP
IN
PP
VP
S
S
t1=
p(t1)=100310070410
10
0403
10 07 10
t2=
p(ts)=180310070204100310020405031001
10
03 03 03
02
04 04
0510
10 10 1007 02 01
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
More Verbs
38
Coordination
39
Much more remainshellip
40
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve2 Choosing the correct parse
bull How do we work out the correct attachment
bull She saw the man with a telescope
bull Is the problem lsquoAI completersquo Yes but hellip
bull Words are good predictors of attachmentbull Even absent full understanding
bull Moscow sent more than 100000 soldiers into Afghanistan hellip
bull Sydney Water breached an agreement with NSW Health hellip
bull Our statistical parsers will try to exploit such statistics
Probabilistic Context Free Grammar
45
Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)
bull G = (Σ N S R P)bull T is a set of terminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull R is a set of rulesproductions of the form X
bull P is a probability function
bull P R [01]
bull
bull A grammar G generates a language model L
P(g) =1g IcircT
aring
PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
Example of a PCFG
48
Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
A Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
The man sleeps
The man saw the woman with the telescope
NNDT Vi
VPNP
NNDT
NP
NNDT
NP
NNDT
NPVt
VP
IN
PP
VP
S
S
t1=
p(t1)=100310070410
10
0403
10 07 10
t2=
p(ts)=180310070204100310020405031001
10
03 03 03
02
04 04
0510
10 10 1007 02 01
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Coordination
39
Much more remainshellip
40
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve2 Choosing the correct parse
bull How do we work out the correct attachment
bull She saw the man with a telescope
bull Is the problem lsquoAI completersquo Yes but hellip
bull Words are good predictors of attachmentbull Even absent full understanding
bull Moscow sent more than 100000 soldiers into Afghanistan hellip
bull Sydney Water breached an agreement with NSW Health hellip
bull Our statistical parsers will try to exploit such statistics
Probabilistic Context Free Grammar
45
Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)
bull G = (Σ N S R P)bull T is a set of terminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull R is a set of rulesproductions of the form X
bull P is a probability function
bull P R [01]
bull
bull A grammar G generates a language model L
P(g) =1g IcircT
aring
PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
Example of a PCFG
48
Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
A Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
The man sleeps
The man saw the woman with the telescope
NNDT Vi
VPNP
NNDT
NP
NNDT
NP
NNDT
NPVt
VP
IN
PP
VP
S
S
t1=
p(t1)=100310070410
10
0403
10 07 10
t2=
p(ts)=180310070204100310020405031001
10
03 03 03
02
04 04
0510
10 10 1007 02 01
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Much more remainshellip
40
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve2 Choosing the correct parse
bull How do we work out the correct attachment
bull She saw the man with a telescope
bull Is the problem lsquoAI completersquo Yes but hellip
bull Words are good predictors of attachmentbull Even absent full understanding
bull Moscow sent more than 100000 soldiers into Afghanistan hellip
bull Sydney Water breached an agreement with NSW Health hellip
bull Our statistical parsers will try to exploit such statistics
Probabilistic Context Free Grammar
45
Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)
bull G = (Σ N S R P)bull T is a set of terminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull R is a set of rulesproductions of the form X
bull P is a probability function
bull P R [01]
bull
bull A grammar G generates a language model L
P(g) =1g IcircT
aring
PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
Example of a PCFG
48
Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
A Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
The man sleeps
The man saw the woman with the telescope
NNDT Vi
VPNP
NNDT
NP
NNDT
NP
NNDT
NPVt
VP
IN
PP
VP
S
S
t1=
p(t1)=100310070410
10
0403
10 07 10
t2=
p(ts)=180310070204100310020405031001
10
03 03 03
02
04 04
0510
10 10 1007 02 01
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve2 Choosing the correct parse
bull How do we work out the correct attachment
bull She saw the man with a telescope
bull Is the problem lsquoAI completersquo Yes but hellip
bull Words are good predictors of attachmentbull Even absent full understanding
bull Moscow sent more than 100000 soldiers into Afghanistan hellip
bull Sydney Water breached an agreement with NSW Health hellip
bull Our statistical parsers will try to exploit such statistics
Probabilistic Context Free Grammar
45
Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)
bull G = (Σ N S R P)bull T is a set of terminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull R is a set of rulesproductions of the form X
bull P is a probability function
bull P R [01]
bull
bull A grammar G generates a language model L
P(g) =1g IcircT
aring
PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
Example of a PCFG
48
Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
A Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
The man sleeps
The man saw the woman with the telescope
NNDT Vi
VPNP
NNDT
NP
NNDT
NP
NNDT
NPVt
VP
IN
PP
VP
S
S
t1=
p(t1)=100310070410
10
0403
10 07 10
t2=
p(ts)=180310070204100310020405031001
10
03 03 03
02
04 04
0510
10 10 1007 02 01
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Parsing Two problems to solve1 Repeated workhellip
Parsing Two problems to solve2 Choosing the correct parse
bull How do we work out the correct attachment
bull She saw the man with a telescope
bull Is the problem lsquoAI completersquo Yes but hellip
bull Words are good predictors of attachmentbull Even absent full understanding
bull Moscow sent more than 100000 soldiers into Afghanistan hellip
bull Sydney Water breached an agreement with NSW Health hellip
bull Our statistical parsers will try to exploit such statistics
Probabilistic Context Free Grammar
45
Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)
bull G = (Σ N S R P)bull T is a set of terminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull R is a set of rulesproductions of the form X
bull P is a probability function
bull P R [01]
bull
bull A grammar G generates a language model L
P(g) =1g IcircT
aring
PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
Example of a PCFG
48
Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
A Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
The man sleeps
The man saw the woman with the telescope
NNDT Vi
VPNP
NNDT
NP
NNDT
NP
NNDT
NPVt
VP
IN
PP
VP
S
S
t1=
p(t1)=100310070410
10
0403
10 07 10
t2=
p(ts)=180310070204100310020405031001
10
03 03 03
02
04 04
0510
10 10 1007 02 01
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Parsing Two problems to solve2 Choosing the correct parse
bull How do we work out the correct attachment
bull She saw the man with a telescope
bull Is the problem lsquoAI completersquo Yes but hellip
bull Words are good predictors of attachmentbull Even absent full understanding
bull Moscow sent more than 100000 soldiers into Afghanistan hellip
bull Sydney Water breached an agreement with NSW Health hellip
bull Our statistical parsers will try to exploit such statistics
Probabilistic Context Free Grammar
45
Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)
bull G = (Σ N S R P)bull T is a set of terminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull R is a set of rulesproductions of the form X
bull P is a probability function
bull P R [01]
bull
bull A grammar G generates a language model L
P(g) =1g IcircT
aring
PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
Example of a PCFG
48
Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
A Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
The man sleeps
The man saw the woman with the telescope
NNDT Vi
VPNP
NNDT
NP
NNDT
NP
NNDT
NPVt
VP
IN
PP
VP
S
S
t1=
p(t1)=100310070410
10
0403
10 07 10
t2=
p(ts)=180310070204100310020405031001
10
03 03 03
02
04 04
0510
10 10 1007 02 01
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Probabilistic Context Free Grammar
45
Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)
bull G = (Σ N S R P)bull T is a set of terminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull R is a set of rulesproductions of the form X
bull P is a probability function
bull P R [01]
bull
bull A grammar G generates a language model L
P(g) =1g IcircT
aring
PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
Example of a PCFG
48
Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
A Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
The man sleeps
The man saw the woman with the telescope
NNDT Vi
VPNP
NNDT
NP
NNDT
NP
NNDT
NPVt
VP
IN
PP
VP
S
S
t1=
p(t1)=100310070410
10
0403
10 07 10
t2=
p(ts)=180310070204100310020405031001
10
03 03 03
02
04 04
0510
10 10 1007 02 01
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)
bull G = (Σ N S R P)bull T is a set of terminal symbols
bull N is a set of nonterminal symbols
bull S is the start symbol (S isin N)
bull R is a set of rulesproductions of the form X
bull P is a probability function
bull P R [01]
bull
bull A grammar G generates a language model L
P(g) =1g IcircT
aring
PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
Example of a PCFG
48
Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
A Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
The man sleeps
The man saw the woman with the telescope
NNDT Vi
VPNP
NNDT
NP
NNDT
NP
NNDT
NPVt
VP
IN
PP
VP
S
S
t1=
p(t1)=100310070410
10
0403
10 07 10
t2=
p(ts)=180310070204100310020405031001
10
03 03 03
02
04 04
0510
10 10 1007 02 01
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
Example of a PCFG
48
Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
A Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
The man sleeps
The man saw the woman with the telescope
NNDT Vi
VPNP
NNDT
NP
NNDT
NP
NNDT
NPVt
VP
IN
PP
VP
S
S
t1=
p(t1)=100310070410
10
0403
10 07 10
t2=
p(ts)=180310070204100310020405031001
10
03 03 03
02
04 04
0510
10 10 1007 02 01
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Example of a PCFG
48
Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
A Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
The man sleeps
The man saw the woman with the telescope
NNDT Vi
VPNP
NNDT
NP
NNDT
NP
NNDT
NPVt
VP
IN
PP
VP
S
S
t1=
p(t1)=100310070410
10
0403
10 07 10
t2=
p(ts)=180310070204100310020405031001
10
03 03 03
02
04 04
0510
10 10 1007 02 01
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
A Probabilistic Context-Free Grammar (PCFG)
S rArr NP VP 10
VP rArr Vi 04
VP rArr Vt NP 04
VP rArr VP PP 02
NP rArr DT NN 03
NP rArr NP PP 07
PP rArr P NP 10
Vi rArr sleeps 10
Vt rArr saw 10
NN rArr man 07
NN rArr woman 02
NN rArr telescope 01
DT rArr the 10
IN rArr with 05
IN rArr in 05
bull Probability of a tree t with rules
α1 rarr β1α2 rarr β2 αn rarr βn
is
p(t) =n
i = 1
q(α i rarr βi )
where q(α rarr β) is the probability for rule α rarr β
44
The man sleeps
The man saw the woman with the telescope
NNDT Vi
VPNP
NNDT
NP
NNDT
NP
NNDT
NPVt
VP
IN
PP
VP
S
S
t1=
p(t1)=100310070410
10
0403
10 07 10
t2=
p(ts)=180310070204100310020405031001
10
03 03 03
02
04 04
0510
10 10 1007 02 01
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
PCFGs Learning and Inference
Model The probability of a tree t with n rules αi βi i = 1n
Learning Read the rules off of labeled sentences use ML estimates for
probabilities
and use all of our standard smoothing tricks
Inference For input sentence s define T(s) to be the set of trees whose yield is s
(whole leaves read left to right match the words in s)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Grammar Transforms
51
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Chomsky Normal Form
bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ
bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language
bull But maybe with different trees
bull Empties and unaries are removed recursively
bull n-ary rules are divided by introducing new nonterminals (n gt 2)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Chomsky Normal Form steps
S NP VP
S VP
VP V NP
VP V
VP V NP PP
VP V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
S V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
V fish
S fish
V tanks
S tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP
NP NP PP
NP PP
NP N
PP P NP
PP P
N people
N fish
N tanks
N rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V NP PP
S V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
A phrase structure grammar
S NP VP
VP V NP
VP V NP PP
NP NP NP
NP NP PP
NP N
NP e
PP P NP
N people
N fish
N tanks
N rods
V people
V fish
V tanks
P with
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Chomsky Normal Form steps
S NP VP
VP V NP
S V NP
VP V VP_V
VP_V NP PP
S V S_V
S_V NP PP
VP V PP
S V PP
NP NP NP
NP NP PP
NP P NP
PP P NP
NP people
NP fish
NP tanks
NP rods
V people
S people
VP people
V fish
S fish
VP fish
V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Chomsky Normal Form
bull You should think of this as a transformation for efficient parsing
bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform
bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy
bull Reconstructing unariesempties is trickier
bull Binarization is crucial for cubic time CFG parsing
bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
ROOT
S
NP VP
N
people
V NP PP
P NP
rodswithtanksfish
NN
An example before binarizationhellip
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
P NP
rods
N
with
NP
N
people tanksfish
N
VP
V NP PP
VP_V
ROOT
S
After binarizationhellip
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Treebank empties and unaries
ROOT
S-HLN
NP-SUBJ VP
VB-NONE-
e Atone
PTB Tree
ROOT
S
NP VP
VB-NONE-
e Atone
NoFuncTags
ROOT
S
VP
VB
Atone
NoEmpties
ROOT
S
Atone
NoUnaries
ROOT
VB
Atone
High Low
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Parsing
66
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Constituency Parsing
fish people fish tanks
Rule Prob θi
S NP VP θ0
NP NP NP θ1
hellip
N fish θ42
N people θ43
V fish θ44
hellip
PCFG
N N V N
VP
NPNP
S
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Cocke-Kasami-Younger (CKY) Constituency Parsing
fish people fish tanks
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Viterbi (Max) Scores
people fish
NP 035V 01N 05
VP 006NP 014V 06N 02
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Extended CKY parsing
bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity
bull Empties can be incorporatedbull Use fenceposts
bull Doesnrsquot increase complexity essentially like unaries
bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
A Recursive Parser
bestScore(Xijs)
if (j == i)
return q(X-gts[i])
else
return max q(X-gtYZ)
bestScore(Yiks)
bestScore(Zk+1js)
kX-gtYZ
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
function CKY(words grammar) returns [most_probable_parseprob]
score = new double[(words)+1][(words)+1][(nonterms)]
back = new Pair[(words)+1][(words)+1][nonterms]]
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if prob gt score[i][i+1][A]
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
The CKY algorithm (19601965)hellip extended to unaries
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
for span = 2 to (words)
for begin = 0 to (words)- span
end = begin + span
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
return buildTree(score back)
The CKY algorithm (19601965)hellip extended to unaries
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
The grammarBinary no epsilons
S NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks 02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
score[0][1]
score[1][2]
score[2][3]
score[3][4]
score[0][2]
score[1][3]
score[2][4]
score[0][3]
score[1][4]
score[0][4]
0
1
2
3
4
1 2 3 4fish people fish tanks
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for i=0 ilt(words) i++
for A in nonterms
if A -gt words[i] in grammar
score[i][i+1][A] = P(A -gt words[i])
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
N fish 02V fish 06
N people 05V people 01
N fish 02V fish 06
N tanks 02V tanks 01
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
if score[i][i+1][B] gt 0 ampamp A-gtB in grammar
prob = P(A-gtB)score[i][i+1][B]
if(prob gt score[i][i+1][A])
score[i][i+1][A] = prob
back[i][i+1][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if (prob gt score[begin][end][A])
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S NP VP000126
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S NP VP000378
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
handle unaries
boolean added = true
while added
added = false
for A B in nonterms
prob = P(A-gtB)score[begin][end][B]
if prob gt score[begin][end][A]
score[begin][end][A] = prob
back[begin][end][A] = B
added = true
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
for split = begin+1 to end-1
for ABC in nonterms
prob=score[begin][split][B]score[split][end][C]P(A-gtBC)
if prob gt score[begin][end][A]
score[begin]end][A] = prob
back[begin][end][A] = new Triple(splitBC)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
N fish 02V fish 06NP N 014VP V 006S VP 0006
N people 05V people 01NP N 035VP V 001S VP 0001
N fish 02V fish 06NP N 014VP V 006S VP 0006
N tanks 02V tanks 01NP N 014VP V 003S VP 0003
NP NP NP00049
VP V NP0105
S VP00105
NP NP NP00049
VP V NP0007
S NP VP00189
NP NP NP000196
VP V NP0042
S VP00042
NP NP NP00000686
VP V NP000147
S NP VP0000882
NP NP NP00000686
VP V NP0000098
S NP VP001323
NP NP NP
00000009604VP V NP
000002058S NP VP
000018522
0
1
2
3
4
1 2 3 4fish people fish tanksS NP VP 09
S VP 01
VP V NP 05
VP V 01
VP V VP_V 03
VP V PP 01
VP_V NP PP 10
NP NP NP 01
NP NP PP 02
NP N 07
PP P NP 10
N people 05
N fish 02
N tanks02
N rods 01
V people 01
V fish 06
V tanks 03
P with 10
Call buildTree(score back) to get the best parse
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Evaluating constituency parsing
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Evaluating constituency parsing
Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)
Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)
Labeled Precision 37 = 429
Labeled Recall 38 = 375
LPLR F1 400
Tagging Accuracy 1111 = 1000
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
How good are PCFGs
bull Penn WSJ parsing accuracy about 73 LPLR F1
bull Robust
bull Usually admit everything but with low probability
bull Partial solution for grammar ambiguity
bull A PCFG gives some idea of the plausibility of a parse
bull But not so good because the independence assumptions are
too strong
bull Give a probabilistic language model
bull But in the simple case it performs worse than a trigram model
bull The problem seems to be that PCFGs lack the
lexicalization of a trigram model
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Weaknesses of PCFGs
87
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Weaknesses
bull Lack of sensitivity to structural frequencies
bull Lack of sensitivity to lexical information
bull (A word is independent of the rest of the tree given its POS)
88
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
A Case of PP Attachment Ambiguity
89
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
90
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
A Case of Coordination Ambiguity
91
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
92
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Structural Preferences Close Attachment
93
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Structural Preferences Close Attachment
bull Example John was believed to have been shot by Bill
bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability
94
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
PCFGs and Independence
bull The symbols in a PCFG define independence assumptions
bull At any node the material inside that node is independent of the material outside that node given the label of that node
bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label
NP
S
VP
S NP VP
NP DT NN
NP
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Non-Independence I
bull The independence assumptions of a PCFG are often too strong
bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)
119
6
NP PP DT NN PRP
9 9
21
NP PP DT NN PRP
74
23
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Non-Independence II
bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong
In the PTB this
construction is
for possessives
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Advanced Unlexicalized Parsing
99
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Horizontal Markovization
bull Horizontal Markovization Merges States
70
71
72
73
74
0 1 2v 2 inf
Horizontal Markov Order
0
3000
6000
9000
12000
0 1 2v 2 inf
Horizontal Markov Order
Sym
bo
ls
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Vertical Markovization
bull Vertical Markov order rewrites depend on past k ancestor nodes
(ie parent annotation)
Order 1 Order 2
7273747576777879
1 2v 2 3v 3
Vertical Markov Order
0
5000
10000
15000
20000
25000
1 2v 2 3v 3
Vertical Markov Order
Sym
bo
ls
Model F1 Size
v=h=2v 778 75K
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Unary Splits
bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used
Annotation F1 Size
Base 778 75K
UNARY 783 80K
Solution Mark unary rewrite sites with -U
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Tag Splits
bull Problem Treebank tags are too coarse
bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN
bull Partial Solutionbull Subdivide the IN tag
Annotation F1 Size
Previous 783 80K
SPLIT-IN 803 81K
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Other Tag Splits
bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)
bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)
bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)
bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]
bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions
bull SPLIT- ldquordquo gets its own tag
F1 Size
804 81K
805 81K
812 85K
816 90K
817 91K
818 93K
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Yield Splits
bull Problem sometimes the behavior of a category depends on something inside its future yield
bull Examplesbull Possessive NPs
bull Finite vs infinite VPs
bull Lexical heads
bull Solution annotate future elements into nodes
Annotation F1 Size
tag splits 823 97K
POSS-NP 831 98K
SPLIT-VP 857 105K
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Distance Recursion Splits
bull Problem vanilla PCFGs cannot distinguish attachment heights
bull Solution mark a property of higher or lower sitesbull Contains a verb
bull Is (non)-recursive
bull Base NPs [cf Collins 99]
bull Right-recursive NPs
Annotation F1 Size
Previous 857 105K
BASE-NP 860 117K
DOMINATES-V 869 141K
RIGHT-REC-NP 870 152K
NP
VP
PP
NP
v
-v
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
A Fully Annotated Tree
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Final Test Set Results
bull Beats ldquofirst generationrdquo lexicalized parsers
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Lexicalised PCFGs
109
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Heads in Comtext Free Rules
110
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Heads
111
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Rules to Recover Heads An Example for NPs
112
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Rules to Recover Heads An Example for VPs
113
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Adding Headwords to Trees
114
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Adding Headwords to Trees
115
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Lexicalized CFGs in Chomsky Normal Form
116
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Example
117
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Lexicalized CKY
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
(VP-gt VBD[saw] NP[her])
(VP-gtVBDNP)[saw]
bestScore(Xijh)
if (j = i)
return score(Xs[i])
else
return
max max score(X[h]-gtY[h]Z[w])
bestScore(Yikh)
bestScore(Zkjw)
max score(X[h]-gtY[w]Z[h])
bestScore(Yikw)
bestScore(Zkjh)
kh
X-gtYZ
kh
X-gtYZ
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Parsing with Lexicalized CFGs
119
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Pruning with Beams
bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY
bull Remember only a few hypotheses for each span ltijgt
bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)
bull Keeps things more or less cubic
bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)
Y[h] Z[hrsquo]
X[h]
i h k hrsquo j
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Parameter Estimation
121
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
A Model from Charniak (1997)
122
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
A Model from Charniak (1997)
123
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Other Details
124
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Final Test Set Results
Parser LP LR F1
Magerman 95 849 846 847
Collins 96 863 858 860
Klein amp Manning 03 869 857 863
Charniak 97 874 875 874
Collins 99 887 886 886
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
AnalysisEvaluation (Method 2)
126
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Dependency Accuracies
127
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Strengths and Weaknesses of Modern Parsers
128
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Modern Parsers
129
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Annotation refines base treebank symbols to
improve statistical fit of the grammar
Parent annotation [Johnson rsquo98]
Head lexicalization [Collins rsquo99 Charniak rsquo00]
Automatic clustering
The Game of Designing a Grammar
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Manual Splits
bull Manually split categoriesbull NP subject vs object
bull DT determiners vs demonstratives
bull IN sentential vs prepositional
bull Advantagesbull Fairly compact grammar
bull Linguistic motivations
bull Disadvantagesbull Performance leveled out
bull Manually annotated
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
ForwardOutside
Learning Latent AnnotationsLatent Annotations
bull Brackets are known
bull Base categories are known
bull Hidden variables for subcategories
X1
X2 X7X4
X5 X6X3
He was right
Can learn with EM like Forward-Backward for HMMs BackwardInside
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Automatic Annotation Induction
bull Advantages
bull Automatically learned
Label all nodes with latent variables
Same number k of subcategoriesfor all categories
bull Disadvantages
bull Grammar gets too large
bull Most categories are oversplit while others are undersplit
Model F1
Klein amp Manning rsquo03 863
Matsuzaki et al rsquo05 867
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Refinement of the DT tag
DT
DT-1 DT-2 DT-3 DT-4
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Hierarchical refinement Repeatedly learn more fine-grained subcategories
start with two (per non-terminal) then keep splitting
initialize each EM run with the output of the last
DT
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Adaptive Splitting
Want to split complex categories more
Idea split everything roll back splits which were
least useful
[Petrov et al 06]
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Adaptive Splitting
bull Evaluate loss in likelihood from removing each split =
Data likelihood with split reversed
Data likelihood with split
bull No loss in accuracy when 50 of the splits are reversed
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Adaptive Splitting Results
Model F1
Previous 884
With 50 Merging 895
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Number of Phrasal Subcategories
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Final Results
F1
le 40 words
F1
all wordsParser
Klein amp Manning rsquo03 863 857
Matsuzaki et al rsquo05 867 861
Collins rsquo99 886 882
Charniak amp Johnson rsquo05 901 896
Petrov et al 06 902 897
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Hierarchical Pruning
hellip QP NP VP hellipcoarse
split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip
hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four
split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip
Parse multiple times with grammars at different levels of granularity
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
Bracket Posteriors
1621 min
111 min
35 min
15 min [912 F1](no search error)
1621 min
111 min
35 min
15 min [912 F1](no search error)