Upload
rosamund-hardy
View
217
Download
0
Embed Size (px)
Citation preview
Tree Kernels for Parsing:
(Collins & Duffy, 2001)Advanced Statistical Methods in NLP
Ling 572February 28, 2012
2
RoadmapCollins & Duffy, 2001
Tree Kernels for Parsing:MotivationParsing as reranking
Tree kernels for similarity
Case study: Penn Treebank parsing
3
Motivation: ParsingParsing task:
Given a natural language sentence, extract its syntactic structureSpecifically, generate a corresponding parse tree
4
Motivation: ParsingParsing task:
Given a natural language sentence, extract its syntactic structureSpecifically, generate a corresponding parse tree
Approaches:
5
Motivation: ParsingParsing task:
Given a natural language sentence, extract its syntactic structureSpecifically, generate a corresponding parse tree
Approaches:“Classical” approach:
Hand-write CFG productions; use standard alg, e.g. CKY
6
Motivation: ParsingParsing task:
Given a natural language sentence, extract its syntactic structureSpecifically, generate a corresponding parse tree
Approaches: “Classical” approach:
Hand-write CFG productions; use standard alg, e.g. CKY Probabilistic approach:
Build large treebank of parsed sentencesLearn production probabilitiesUse probabilistic versions of standard algPick highest probability parse
8
Parsing IssuesMain issues:
Robustness: get reasonable parse for any inputAmbiguity: select best parse given alternatives
“Classic” approach:
9
Parsing IssuesMain issues:
Robustness: get reasonable parse for any inputAmbiguity: select best parse given alternatives
“Classic” approach:Hand-coded grammars often fragileNo obvious mechanism to select among
alternatives
Probabilistic approach:
10
Parsing IssuesMain issues:
Robustness: get reasonable parse for any inputAmbiguity: select best parse given alternatives
“Classic” approach:Hand-coded grammars often fragileNo obvious mechanism to select among alternatives
Probabilistic approach:Fairly good robustness, small probabilities for anySelect by probability, but decisions are local
Hard to capture more global structure
11
Approach: Parsing by Reranking
Intuition: Identify collection of candidate parses for each
sentencee.g. output of PCFG parserFor training, identify gold standard parse, sentence pair
12
Approach: Parsing by Reranking
Intuition: Identify collection of candidate parses for each
sentencee.g. output of PCFG parserFor training, identify gold standard parse, sentence pair
Create a parse tree vector representationIdentify (more global) parse tree features
13
Approach: Parsing by Reranking
Intuition: Identify collection of candidate parses for each
sentencee.g. output of PCFG parserFor training, identify gold standard parse, sentence pair
Create a parse tree vector representationIdentify (more global) parse tree features
Train a reranker to rank gold standard highest
14
Approach: Parsing by Reranking
Intuition: Identify collection of candidate parses for each sentence
e.g. output of PCFG parserFor training, identify gold standard parse, sentence pair
Create a parse tree vector representationIdentify (more global) parse tree features
Train a reranker to rank gold standard highest
Apply to rerank candidate parses for new sentence
15
Parsing as Reranking, Formally
Training data pairs: {(si,ti)}
where si is a sentence, ti is a parse tree
16
Parsing as Reranking, Formally
Training data pairs: {(si,ti)}
where si is a sentence, ti is a parse tree
17
Parsing as Reranking, Formally
Training data pairs: {(si,ti)}
where si is a sentence, ti is a parse tree
C(si) = {xij}:
Candidate parses for si
wlog, xi1 is the correct parse for si
18
Parsing as Reranking, Formally
Training data pairs: {(si,ti)}
where si is a sentence, ti is a parse tree
C(si) = {xij}:
Candidate parses for si
wlog, xi1 is the correct parse for si
h(xij): feature vector representation of xij
19
Parsing as Reranking, Formally
Training data pairs: {(si,ti)}
where si is a sentence, ti is a parse tree
C(si) = {xij}:
Candidate parses for si
wlog, xi1 is the correct parse for si
h(xij): feature vector representation of xij
Training: Learn
20
Parsing as Reranking, Formally
Training data pairs: {(si,ti)}
where si is a sentence, ti is a parse tree
C(si) = {xij}:
Candidate parses for si
wlog, xi1 is the correct parse for si
h(xij): feature vector representation of xij
Training: Learn
Decoding: Compute
21
Parsing as Reranking: Training
Consider the hard-margin SVM model:Minimize ||w||2 subject to constraints
22
Parsing as Reranking: Training
Consider the hard-margin SVM model:Minimize ||w||2 subject to constraints
What constraints?
23
Parsing as Reranking: Training
Consider the hard-margin SVM model:Minimize ||w||2 subject to constraints
What constraints?Here, ranking constraints:
Specifically, correct parse outranks all other candidates
24
Parsing as Reranking: Training
Consider the hard-margin SVM model:Minimize ||w||2 subject to constraints
What constraints?Here, ranking constraints:
Specifically, correct parse outranks all other candidates
Formally,
25
Parsing as Reranking: Training
Consider the hard-margin SVM model:Minimize ||w||2 subject to constraints
What constraints?Here, ranking constraints:
Specifically, correct parse outranks all other candidates
Formally,
26
Parsing as Reranking: Training
Consider the hard-margin SVM model:Minimize ||w||2 subject to constraints
What constraints?Here, ranking constraints:
Specifically, correct parse outranks all other candidates
Formally,
28
Reformulating with αTraining learns αij , such that
Note: just like SVM equation, w/different constraint
29
Reformulating with αTraining learns αij , such that
Note: just like SVM equation, w/different constraint
Parse scoring:
30
Reformulating with αTraining learns αij , such that
Note: just like SVM equation, w/different constraint
Parse scoring: After substitution, we have
31
Reformulating with αTraining learns αij , such that
Note: just like SVM equation, w/different constraint
Parse scoring: After substitution, we have
After the kernel trick, we have
32
Reformulating with αTraining learns αij , such that
Note: just like SVM equation, w/different constraint
Parse scoring: After substitution, we have
After the kernel trick, we have
Note: With a suitable kernel K, don’t need h(x)s
34
Parsing as reranking:Perceptron algorithm
Similar to SVM, learns separating hyperplaneModeled with weight vector w Using simple iterative procedure
Based on correcting errors in current model
35
Parsing as reranking:Perceptron algorithm
Similar to SVM, learns separating hyperplaneModeled with weight vector w Using simple iterative procedure
Based on correcting errors in current model
Initialize αij=0
36
Parsing as reranking:Perceptron algorithm
Similar to SVM, learns separating hyperplaneModeled with weight vector w Using simple iterative procedure
Based on correcting errors in current model
Initialize αij=0
For i=1,…,n; for j=2,…,n If f(xi1)>f(xij):
continue
else: αij += 1
38
Defining the KernelSo, we have a model:
Framework for trainingFramework for decoding
But need to define a kernel K
39
Defining the KernelSo, we have a model:
Framework for trainingFramework for decoding
But need to define a kernel K
We need: K: X x X R
40
Defining the KernelSo, we have a model:
Framework for trainingFramework for decoding
But need to define a kernel K
We need: K: X x X R
Recall that X is a tree, and K is a similarity function
42
What’s in a Kernel?What are good attributes of a kernel?
Capture similarity between instancesHere, between parse trees
43
What’s in a Kernel?What are good attributes of a kernel?
Capture similarity between instancesHere, between parse trees
Capture more global parse information than PCFG
44
What’s in a Kernel?What are good attributes of a kernel?
Capture similarity between instancesHere, between parse trees
Capture more global parse information than PCFG
Computable tractably, even over complex, large trees
45
Tree Kernel ProposalIdea:
PCFG models learn MLE probabilities on rewrite rulesNP N vs NP DT N vs NP PN vs NP DT JJ NLocal to parent:children levels
46
Tree Kernel ProposalIdea:
PCFG models learn MLE probabilities on rewrite rulesNP N vs NP DT N vs NP PN vs NP DT JJ NLocal to parent:children levels
New measure incorporates all tree fragments in parseCaptures higher order, longer distances
dependencies Track counts of individual rules + much more
48
Tree RepresentationTree fragments:
Any subgraph with more than one node Restriction: Must include full (not partial) rule
productions
Parse tree representation:h(T) = (h1(T),h2(T),…hn(T))
n: number of distinct tree fragments in training datahi(T): # of occurrences of ith tree fragment in current
tree
49
Tree RepresentationTree fragments:
Any subgraph with more than one node Restriction: Must include full (not partial) rule
productions
Parse tree representation:h(T) = (h1(T),h2(T),…hn(T))
n: number of distinct tree fragments in training datahi(T): # of occurrences of ith tree fragment in current
tree
51
Tree RepresentationPros:
Fairly intuitive model Natural inner product interpretationCaptures long- and short-range dependencies
Cons:
52
Tree RepresentationPros:
Fairly intuitive model Natural inner product interpretationCaptures long- and short-range dependencies
Cons:Size!!!: # subtrees exponential in size of treeDirect computation of inner product intractable
54
Key ChallengeEfficient computation:
Find a kernel that can compute similarity efficientlyIn terms of common subtrees
55
Key ChallengeEfficient computation:
Find a kernel that can compute similarity efficientlyIn terms of common subtrees
Pure enumeration clearly intractable
56
Key ChallengeEfficient computation:
Find a kernel that can compute similarity efficientlyIn terms of common subtrees
Pure enumeration clearly intractable
Compute recursively over subtreesUsing a polynomial process
57
Counting Common Subtrees
Example: C(n1,n2): number of common subtrees rooted at
n1,n2C(n1,n2):
Due to F. Xia
aa red apple
58
Calculating C(n1,n2)Given two subtrees rooted at n1 and n2
If productions at n1 and n2 are different,
59
Calculating C(n1,n2)Given two subtrees rooted at n1 and n2
If productions at n1 and n2 are different,C(n1,n2) = 0
If productions at n1 and n2 are the same,And n1 and n2 are preterminals,
60
Calculating C(n1,n2)Given two subtrees rooted at n1 and n2
If productions at n1 and n2 are different,C(n1,n2) = 0
If productions at n1 and n2 are the same,And n1 and n2 are preterminals,
C(n1,n2) =1Else:
61
Calculating C(n1,n2)Given two subtrees rooted at n1 and n2
If productions at n1 and n2 are different,C(n1,n2) = 0
If productions at n1 and n2 are the same,And n1 and n2 are preterminals,
C(n1,n2) =1Else:
nc(n1): # children of n1: What about n2?
62
Calculating C(n1,n2)Given two subtrees rooted at n1 and n2
If productions at n1 and n2 are different,C(n1,n2) = 0
If productions at n1 and n2 are the same,And n1 and n2 are preterminals,
C(n1,n2) =1Else:
nc(n1): # children of n1: What about n2? same production same # children
ch(n1,j) jth child of n1
63
Components of Tree Representation
Tree representation: h(T) = (h1(T),h2(T),…hn(T))
Consider 2 trees: T1, T2
Number of nodes: N1,N2 , respectively
64
Components of Tree Representation
Tree representation: h(T) = (h1(T),h2(T),…hn(T))
Consider 2 trees: T1, T2
Number of nodes: N1,N2 , respectively
Define: Ii(n) = 1if ith subtree is rooted at n, 0 o.w.
65
Components of Tree Representation
Tree representation: h(T) = (h1(T),h2(T),…hn(T))
Consider 2 trees: T1, T2
Number of nodes: N1,N2 , respectively
Define: Ii(n) = 1if ith subtree is rooted at n, 0 o.w.
Then,
66
Components of Tree Representation
Tree representation: h(T) = (h1(T),h2(T),…hn(T))
Consider 2 trees: T1, T2
Number of nodes: N1,N2 , respectively
Define: Ii(n) = 1if ith subtree is rooted at n, 0 o.w.
Then,
69
Tree Kernel ComputationRunning time: K(T1,T2)
O(N1N2): Based on recursive computation of C
Remaining Issues:
70
Tree Kernel ComputationRunning time: K(T1,T2)
O(N1N2): Based on recursive computation of C
Remaining Issues:K(T1,T2) depends on size of T1 and T2
71
Tree Kernel ComputationRunning time: K(T1,T2)
O(N1N2): Based on recursive computation of C
Remaining Issues:K(T1,T2) depends on size of T1 and T2
K(T1,T1) >> K(T1,T2), T1 != T2106 vs 102 : Very ‘peaked’
75
ImprovementsManaging tree size:
Normalize!! (like cosine similarity)
Downweight large trees:Restrict depth: just thresholdRescale with weight λ
76
RescalingGiven two subtrees rooted at n1 and n2
If productions at n1 and n2 are different,C(n1,n2) = 0
If productions at n1 and n2 are the same,And n1 and n2 are preterminals,
C(n1,n2) =λElse:
0<λ<= 1
77
RescalingGiven two subtrees rooted at n1 and n2
If productions at n1 and n2 are different,C(n1,n2) = 0
If productions at n1 and n2 are the same,And n1 and n2 are preterminals,
C(n1,n2) =λElse:
0<λ<= 1
79
Parsing ExperimentData:
Penn Treebank ATIS corpus segmentTraining: 800 sentences
Top 20 parsesDevelopment: 200 sentencesTest: 336 sentences
Select best candidate from top 100 parses
80
Parsing ExperimentData:
Penn Treebank ATIS corpus segmentTraining: 800 sentences
Top 20 parsesDevelopment: 200 sentencesTest: 336 sentences
Select best candidate from top 100 parses
Classifier: Voted perceptron Kernelized like SVM, more computationally
tractable
81
Parsing ExperimentData:
Penn Treebank ATIS corpus segmentTraining: 800 sentences
Top 20 parsesDevelopment: 200 sentencesTest: 336 sentences
Select best candidate from top 100 parses
Classifier: Voted perceptron Kernelized like SVM, more computationally tractable
Evaluation: 10 runs, average parse score reported