Upload
lekiet
View
219
Download
1
Embed Size (px)
Citation preview
Why parsing?
I As linguistic inquiryI What do we need to put in our model to do a good job?I Early on, lots of connections between syntactic theorists
and comp ling parsing peopleI Now not so much
I To add long-distance dependenciesI Correct some errors in tagging, LMs
I To aid information extraction, translationI Via compositional semanticsI An active area; results not yet amazing
I To model writing style, information structure, processingI Parser as model of processing
2
Parsing: basics of representation
I Parsing: find syntactic structure for input sentenceI What kind of structure?I Most popular answer given by Penn Treebank project
I Marcus, Santorini and Marcinkiewicz ‘93I Many person-hours of grad student laborI The PTB uses a phrase structure grammar...I Loosely based on Government and Binding theoryI Plus some null elements (traces) for movementI Many researchers use this formalism by default
I Better than making your own treebank!
To the degree that most grammatical formalisms tend to capturethe same regularities this can still be a successful strategy evenif no one formalism is widely preferred over the rest.–Charniak
3
What PTB trees look like
S
������
���
��
HHH
HHH
HHH
HH
NP-SBJ
���
HHH
NNP
Ms.
�� HH
NNP Haag
VP�� HH
VBZ
plays
NP
NNP
Elianti
.
.
( (S(NP-SBJ (NNP Ms.) (NNP Haag) )(VP (VBZ plays)(NP (NNP Elianti) ))
(. .) ))4
Parsing the PTB
I Most of the influential statistical parsing projects don’t dotraces or movement
I Algorithms that actually move things are horribly inefficientI Real problem (still open) is to get long-distance
dependenciesI Nor do they do the function annotations (like -SBJ)I Although these can be done fairly easilyI Without these, the grammar is context-free
This lecture: context-free grammar parsing and some importantingredients for a good parser
6
Review: CFGs
I Produce a string of terminalsI By expanding nonterminal symbolsI Using rewriting rules
S → NP VPNP → DT NNNP → NNPDT → theNN → catNN → matVP → VVP → V PPV → satPP → IN NPIN → on
7
Bottom-up Recognizer for CFG
Relies on a data structure called the chartI Like trellis for HMMs
the cat sat on the mat
length 1
length 6
2
3
4
5
@1
@2
@3
@4
@5
@6
beginning:
8
Step 1: terminal rules
the cat sat on the mat
length 1
length 6
2
3
4
5
@1
@2
@3
@4
@5
@6
beginning:
D N VPV
P D N
9
Next step
the cat sat on the mat
length 1
length 6
2
3
4
5
@1
@2
@3
@4
@5
@6
beginning:
D N VPV
P D N
NP
NP (length 2) from 1-2 because NP→ D N and D from 1-1 andN from 2-2
10
A few steps forward
the cat sat on the mat
length 1
length 6
2
3
4
5
@1
@2
@3
@4
@5
@6
beginning:
D N VPV
P D N
NP NP
S
S (length 3) from 1-3 because S→ NP VP and NP from 1-2and VP from 3-3
11
Probabilistic CFGs (PCFGs)
Each rule gets a probability P(right side|left side)I Since these are conditional, they sum to 1 for each LHS
S → NP VP 1NP → DT NN .9NP → NNP .1DT → the 1NN → cat .5NN → mat .5VP → V .4VP → V PP .6V → sat 1PP → IN NP 1IN → on 1
12
The modelA tree T is made by repeatedly applying rules:
I The probability of a tree T and string W is the product ofall the rule probs
I These include rules that insert nonterminalsI And rules that insert words
P(T ,W1:n) =∏
A→B C∈T
P(B C|A)∏
A→W∈T
P(Wi |A)
S���
HHH
NP�� HH
D
the
N
cat
VP
V
sat
P(NP VP|S)P(D N|NP)P(the|D)P(cat|N)P(V |VP)P(sat |V )
1× .9× 1× .5× .4× 113
CKY algorithm with numbers
the cat sat on the mat
length 1
length 6
2
3
4
5
@1
@2
@3
@4
@5
@6
beginning:
D: 1N: .5 VP: .4
V: 1P: 1 D: 1 N: 1
Each word gets P(NT |word)
14
CKY algorithm with numbers II
the cat sat on the mat
length 1
length 6
2
3
4
5
@1
@2
@3
@4
@5
@6
beginning:
NP: .45
D: 1 N: .5 VP: .4V: 1
P: 1 D: 1 N: 1
When we apply a rule, we multiply the rule prob and the probsof the component parts from the chart
I P(NP → D N) = .9, P(D at 1− 1) = 1 andP(N at 2− 2) = .5
15
More formally
CKY: Cocke-Kasami-Younger algorithmA max-product algorithm for PCFGs
I Is a dynamic programI Quite similar to Viterbi for HMMsI Works for grammar with binary rules
I Discuss unary and n-ary rules in a secondI Also has a sum-product version
I Which computes P(W1:n)
16
DefinitionWe operate by computing µA(i , l), the best way to parse anonterminal of type A and length l starting at i :
µA(i , l) = maxtree T
P(Wi:i+l ,T , root ofT = A)
I To do so, we must apply a rule to make A from some thingsB, C
I These things must cover the span (i , l)I We divide the span into (i , j) and (i + j , l − j)
I For instance, span (1, 4) can be split: (1, 2), (2,4)I Or (1, 3), (3, 4)
I We have to search over potential j
µA(i , l) = max0<j<l,A→B C
P(A→ B C)µB(i , j)µC(i + j , l − j)
17
The solution
As with HMMs, notice that the last chart entry is the answer:
µS(1,n) = maxtree T
P(W1:n,T , root ofT = S)
Joint probability of the sentence and the best tree rooted at SI The conditional is proportional to the jointI By a factor of P(W1:n)
18
Efficiency
For an n-word sentence and G grammar rules, CKY runs inO(Gn3) time
I The chart is (less than) n2 sizeI To fill in cell i − k , search for split point jI No more than n places to lookI Must look once per each rule
19
Binarizing the grammarCKY searches for a single split point:
I So we need the grammar to be binaryI Each rule has (one or) two children
I Unary (one-child) rules also cause problems, but seeCharniak for these
I There is an equivalent binary grammar for any CFGI Chomsky Normal Form
NP
������
��@@
PPPP
PP
DT
an
NNP
Oct.
CD
19
NN
review
NP
���
HHH
DT
an
NNP-CD-NN
���
HHH
NNP
Oct.
CD-NN�� HH
CD
19
NN
review
20
Building a lousy parser
We now know how to build a parser:I Take Penn Treebank, convert to Chomsky Normal (binary)
formI Estimate rule probs by smoothed max likelihood
P̂(A→ BC) =#(A→ B C) + λ
#(A→ •) + |B,C|λ
I Parse with CKYThis parser isn’t very good
I 70% F-score... modern parsers >90%Why not?
21
Stupid parser assumptions
This parser makes stupid independence assumptions:I Decision to produce NT independent of expansion of NT
I Eg, “expects [S sales to increase]” (non-finite clause)I vs “hopes [S sales will increase]” (finite clause)I Or “a [NN chair]” (count)I vs “*a [NN pork]” (mass)
I No subcategories, feature checking, etcOn the other hand, it’s too restrictive:
I Prob of “[ADJP very good]”, “[ADJP very very good]”,“[ADJP very very very good]” unrelated
I Depend on very specific training countsI No concept of adjunction, productive processes, etc
22
Making the grammar more specificCapture subcategorization effects by annotating heads andparent categories:
I (S (NP (DT the) (JJ recent) (NNS sale)) ...)I (NP^S-NNS (DT the) (JJ recent) (NN sale))I “Vertical Markovization”
S
����
HHH
H
NP^S-NNP
����
HHH
H
DT
the
JJ
recent
NN
sale
...
...
I NP^S-NNP (subject proper NP)I vs NP^PP-PRO (pronoun in prep. phrase)
23
Making the grammar less specificUse X-bar-like structure to allow adjunction:
I “Horizontal Markovization”NP-NNS
@N-+DT-NNS
����
HHH
H
DT
the
@N-+JJ-NNS
����
HHHH
JJ
recent
@N-NNS+PP
���
HHH
@N-NNS
NNS
sale
PP-IN�� HH
of shares
I Any @N symbol can produce adjunctsI The “+CONTEXT” version represents modifier orders
24
Lexicalized grammarsSometimes useful to incorporate words into the grammar rulesdirectly:
I Verbs subcategorize for prepositionsI CollocationsI Crude semantics
For example, might have:
VP-VB-deliver
����
HHH
H
VB
deliver
PP-TO-to
���
HHH
TO
to
NP-NN-store�� HH
the store
Uses grammar rule VP-VB-deliver→ VB PP-TO-toI Checks if “deliver” takes “to” as preposition
25
Smoothing lexicalized parsers
Lexicalized parsers have very large grammars:I Grammar is never written out explicitlyI Rules are made up when neededI Probability of a rule is hard to estimate
I Rule like “VP-deliver→ VB PP-to” is rare
Typical approach: interpolated estimation (like LM)I Use PCFG without words as backoff
P̂(VP-VB-deliver→ VB PP-TO-to) ∝#(VP-VB-deliver→ VB PP-TO-to) + βP(VP→ VB PP)
Exact backoff strategy depends on which parser you look at
26
Discriminative methodsI Like an HMM, a PCFG is a generative modelI Obvious question: is there a discriminative (CRF-like)
model?I Early answer: yes, but it takes weeks to train
I And the features we want to use often violate the Markovproperty
I For instance, constructionsI Checking agreement across multiple nodes
I ...in which case you can’t train at all!
RerankingA combined discriminative/generative system:
I Print out the top 100 parsesI Use max-ent to pick the best one (using arbitrary features)
I Not necessarily the one with highest original probI Max-ent features can violate the Markov propertyI Because label set size reduced to 100
27
Parsing results
I Naive PCFG from Treebank: 70%I Markovization, lexicalize a few prep/comp/conj: 87% (short
ss)I Klein+Manning 2003
I Fully lexicalized CFG with backoff smoothing: 87-88%(short ss)
I Charniak 1997, Collins 1999I Reranking with large max-ent feature set: 90-91% (all ss)
I Collins 2000, Charniak+Johnson 2005I Further improvements to about 92.5%
I Notably Petrov+Klein Berkeley parser, which we will discusslater
28