83
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012

Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012

Embed Size (px)

Citation preview

Tree Kernels for Parsing:

(Collins & Duffy, 2001)Advanced Statistical Methods in NLP

Ling 572February 28, 2012

2

RoadmapCollins & Duffy, 2001

Tree Kernels for Parsing:MotivationParsing as reranking

Tree kernels for similarity

Case study: Penn Treebank parsing

3

Motivation: ParsingParsing task:

Given a natural language sentence, extract its syntactic structureSpecifically, generate a corresponding parse tree

4

Motivation: ParsingParsing task:

Given a natural language sentence, extract its syntactic structureSpecifically, generate a corresponding parse tree

Approaches:

5

Motivation: ParsingParsing task:

Given a natural language sentence, extract its syntactic structureSpecifically, generate a corresponding parse tree

Approaches:“Classical” approach:

Hand-write CFG productions; use standard alg, e.g. CKY

6

Motivation: ParsingParsing task:

Given a natural language sentence, extract its syntactic structureSpecifically, generate a corresponding parse tree

Approaches: “Classical” approach:

Hand-write CFG productions; use standard alg, e.g. CKY Probabilistic approach:

Build large treebank of parsed sentencesLearn production probabilitiesUse probabilistic versions of standard algPick highest probability parse

7

Parsing IssuesMain issues:

8

Parsing IssuesMain issues:

Robustness: get reasonable parse for any inputAmbiguity: select best parse given alternatives

“Classic” approach:

9

Parsing IssuesMain issues:

Robustness: get reasonable parse for any inputAmbiguity: select best parse given alternatives

“Classic” approach:Hand-coded grammars often fragileNo obvious mechanism to select among

alternatives

Probabilistic approach:

10

Parsing IssuesMain issues:

Robustness: get reasonable parse for any inputAmbiguity: select best parse given alternatives

“Classic” approach:Hand-coded grammars often fragileNo obvious mechanism to select among alternatives

Probabilistic approach:Fairly good robustness, small probabilities for anySelect by probability, but decisions are local

Hard to capture more global structure

11

Approach: Parsing by Reranking

Intuition: Identify collection of candidate parses for each

sentencee.g. output of PCFG parserFor training, identify gold standard parse, sentence pair

12

Approach: Parsing by Reranking

Intuition: Identify collection of candidate parses for each

sentencee.g. output of PCFG parserFor training, identify gold standard parse, sentence pair

Create a parse tree vector representationIdentify (more global) parse tree features

13

Approach: Parsing by Reranking

Intuition: Identify collection of candidate parses for each

sentencee.g. output of PCFG parserFor training, identify gold standard parse, sentence pair

Create a parse tree vector representationIdentify (more global) parse tree features

Train a reranker to rank gold standard highest

14

Approach: Parsing by Reranking

Intuition: Identify collection of candidate parses for each sentence

e.g. output of PCFG parserFor training, identify gold standard parse, sentence pair

Create a parse tree vector representationIdentify (more global) parse tree features

Train a reranker to rank gold standard highest

Apply to rerank candidate parses for new sentence

15

Parsing as Reranking, Formally

Training data pairs: {(si,ti)}

where si is a sentence, ti is a parse tree

16

Parsing as Reranking, Formally

Training data pairs: {(si,ti)}

where si is a sentence, ti is a parse tree

17

Parsing as Reranking, Formally

Training data pairs: {(si,ti)}

where si is a sentence, ti is a parse tree

C(si) = {xij}:

Candidate parses for si

wlog, xi1 is the correct parse for si

18

Parsing as Reranking, Formally

Training data pairs: {(si,ti)}

where si is a sentence, ti is a parse tree

C(si) = {xij}:

Candidate parses for si

wlog, xi1 is the correct parse for si

h(xij): feature vector representation of xij

19

Parsing as Reranking, Formally

Training data pairs: {(si,ti)}

where si is a sentence, ti is a parse tree

C(si) = {xij}:

Candidate parses for si

wlog, xi1 is the correct parse for si

h(xij): feature vector representation of xij

Training: Learn

20

Parsing as Reranking, Formally

Training data pairs: {(si,ti)}

where si is a sentence, ti is a parse tree

C(si) = {xij}:

Candidate parses for si

wlog, xi1 is the correct parse for si

h(xij): feature vector representation of xij

Training: Learn

Decoding: Compute

21

Parsing as Reranking: Training

Consider the hard-margin SVM model:Minimize ||w||2 subject to constraints

22

Parsing as Reranking: Training

Consider the hard-margin SVM model:Minimize ||w||2 subject to constraints

What constraints?

23

Parsing as Reranking: Training

Consider the hard-margin SVM model:Minimize ||w||2 subject to constraints

What constraints?Here, ranking constraints:

Specifically, correct parse outranks all other candidates

24

Parsing as Reranking: Training

Consider the hard-margin SVM model:Minimize ||w||2 subject to constraints

What constraints?Here, ranking constraints:

Specifically, correct parse outranks all other candidates

Formally,

25

Parsing as Reranking: Training

Consider the hard-margin SVM model:Minimize ||w||2 subject to constraints

What constraints?Here, ranking constraints:

Specifically, correct parse outranks all other candidates

Formally,

26

Parsing as Reranking: Training

Consider the hard-margin SVM model:Minimize ||w||2 subject to constraints

What constraints?Here, ranking constraints:

Specifically, correct parse outranks all other candidates

Formally,

27

Reformulating with αTraining learns αij , such that

28

Reformulating with αTraining learns αij , such that

Note: just like SVM equation, w/different constraint

29

Reformulating with αTraining learns αij , such that

Note: just like SVM equation, w/different constraint

Parse scoring:

30

Reformulating with αTraining learns αij , such that

Note: just like SVM equation, w/different constraint

Parse scoring: After substitution, we have

31

Reformulating with αTraining learns αij , such that

Note: just like SVM equation, w/different constraint

Parse scoring: After substitution, we have

After the kernel trick, we have

32

Reformulating with αTraining learns αij , such that

Note: just like SVM equation, w/different constraint

Parse scoring: After substitution, we have

After the kernel trick, we have

Note: With a suitable kernel K, don’t need h(x)s

33

Parsing as reranking:Perceptron algorithm

Similar to SVM, learns separating hyperplane

34

Parsing as reranking:Perceptron algorithm

Similar to SVM, learns separating hyperplaneModeled with weight vector w Using simple iterative procedure

Based on correcting errors in current model

35

Parsing as reranking:Perceptron algorithm

Similar to SVM, learns separating hyperplaneModeled with weight vector w Using simple iterative procedure

Based on correcting errors in current model

Initialize αij=0

36

Parsing as reranking:Perceptron algorithm

Similar to SVM, learns separating hyperplaneModeled with weight vector w Using simple iterative procedure

Based on correcting errors in current model

Initialize αij=0

For i=1,…,n; for j=2,…,n If f(xi1)>f(xij):

continue

else: αij += 1

37

Defining the KernelSo, we have a model:

Framework for trainingFramework for decoding

38

Defining the KernelSo, we have a model:

Framework for trainingFramework for decoding

But need to define a kernel K

39

Defining the KernelSo, we have a model:

Framework for trainingFramework for decoding

But need to define a kernel K

We need: K: X x X R

40

Defining the KernelSo, we have a model:

Framework for trainingFramework for decoding

But need to define a kernel K

We need: K: X x X R

Recall that X is a tree, and K is a similarity function

41

What’s in a Kernel?What are good attributes of a kernel?

42

What’s in a Kernel?What are good attributes of a kernel?

Capture similarity between instancesHere, between parse trees

43

What’s in a Kernel?What are good attributes of a kernel?

Capture similarity between instancesHere, between parse trees

Capture more global parse information than PCFG

44

What’s in a Kernel?What are good attributes of a kernel?

Capture similarity between instancesHere, between parse trees

Capture more global parse information than PCFG

Computable tractably, even over complex, large trees

45

Tree Kernel ProposalIdea:

PCFG models learn MLE probabilities on rewrite rulesNP N vs NP DT N vs NP PN vs NP DT JJ NLocal to parent:children levels

46

Tree Kernel ProposalIdea:

PCFG models learn MLE probabilities on rewrite rulesNP N vs NP DT N vs NP PN vs NP DT JJ NLocal to parent:children levels

New measure incorporates all tree fragments in parseCaptures higher order, longer distances

dependencies Track counts of individual rules + much more

47

Tree Fragment ExampleFragments of NP over ‘apple’

Not exhaustive

48

Tree RepresentationTree fragments:

Any subgraph with more than one node Restriction: Must include full (not partial) rule

productions

Parse tree representation:h(T) = (h1(T),h2(T),…hn(T))

n: number of distinct tree fragments in training datahi(T): # of occurrences of ith tree fragment in current

tree

49

Tree RepresentationTree fragments:

Any subgraph with more than one node Restriction: Must include full (not partial) rule

productions

Parse tree representation:h(T) = (h1(T),h2(T),…hn(T))

n: number of distinct tree fragments in training datahi(T): # of occurrences of ith tree fragment in current

tree

50

Tree RepresentationPros:

51

Tree RepresentationPros:

Fairly intuitive model Natural inner product interpretationCaptures long- and short-range dependencies

Cons:

52

Tree RepresentationPros:

Fairly intuitive model Natural inner product interpretationCaptures long- and short-range dependencies

Cons:Size!!!: # subtrees exponential in size of treeDirect computation of inner product intractable

53

Key Challenge

54

Key ChallengeEfficient computation:

Find a kernel that can compute similarity efficientlyIn terms of common subtrees

55

Key ChallengeEfficient computation:

Find a kernel that can compute similarity efficientlyIn terms of common subtrees

Pure enumeration clearly intractable

56

Key ChallengeEfficient computation:

Find a kernel that can compute similarity efficientlyIn terms of common subtrees

Pure enumeration clearly intractable

Compute recursively over subtreesUsing a polynomial process

57

Counting Common Subtrees

Example: C(n1,n2): number of common subtrees rooted at

n1,n2C(n1,n2):

Due to F. Xia

aa red apple

58

Calculating C(n1,n2)Given two subtrees rooted at n1 and n2

If productions at n1 and n2 are different,

59

Calculating C(n1,n2)Given two subtrees rooted at n1 and n2

If productions at n1 and n2 are different,C(n1,n2) = 0

If productions at n1 and n2 are the same,And n1 and n2 are preterminals,

60

Calculating C(n1,n2)Given two subtrees rooted at n1 and n2

If productions at n1 and n2 are different,C(n1,n2) = 0

If productions at n1 and n2 are the same,And n1 and n2 are preterminals,

C(n1,n2) =1Else:

61

Calculating C(n1,n2)Given two subtrees rooted at n1 and n2

If productions at n1 and n2 are different,C(n1,n2) = 0

If productions at n1 and n2 are the same,And n1 and n2 are preterminals,

C(n1,n2) =1Else:

nc(n1): # children of n1: What about n2?

62

Calculating C(n1,n2)Given two subtrees rooted at n1 and n2

If productions at n1 and n2 are different,C(n1,n2) = 0

If productions at n1 and n2 are the same,And n1 and n2 are preterminals,

C(n1,n2) =1Else:

nc(n1): # children of n1: What about n2? same production same # children

ch(n1,j) jth child of n1

63

Components of Tree Representation

Tree representation: h(T) = (h1(T),h2(T),…hn(T))

Consider 2 trees: T1, T2

Number of nodes: N1,N2 , respectively

64

Components of Tree Representation

Tree representation: h(T) = (h1(T),h2(T),…hn(T))

Consider 2 trees: T1, T2

Number of nodes: N1,N2 , respectively

Define: Ii(n) = 1if ith subtree is rooted at n, 0 o.w.

65

Components of Tree Representation

Tree representation: h(T) = (h1(T),h2(T),…hn(T))

Consider 2 trees: T1, T2

Number of nodes: N1,N2 , respectively

Define: Ii(n) = 1if ith subtree is rooted at n, 0 o.w.

Then,

66

Components of Tree Representation

Tree representation: h(T) = (h1(T),h2(T),…hn(T))

Consider 2 trees: T1, T2

Number of nodes: N1,N2 , respectively

Define: Ii(n) = 1if ith subtree is rooted at n, 0 o.w.

Then,

67

Tree Kernel Computation

68

Tree Kernel ComputationRunning time: K(T1,T2)

O(N1N2): Based on recursive computation of C

69

Tree Kernel ComputationRunning time: K(T1,T2)

O(N1N2): Based on recursive computation of C

Remaining Issues:

70

Tree Kernel ComputationRunning time: K(T1,T2)

O(N1N2): Based on recursive computation of C

Remaining Issues:K(T1,T2) depends on size of T1 and T2

71

Tree Kernel ComputationRunning time: K(T1,T2)

O(N1N2): Based on recursive computation of C

Remaining Issues:K(T1,T2) depends on size of T1 and T2

K(T1,T1) >> K(T1,T2), T1 != T2106 vs 102 : Very ‘peaked’

72

ImprovementsManaging tree size:

73

ImprovementsManaging tree size:

Normalize!! (like cosine similarity)

74

ImprovementsManaging tree size:

Normalize!! (like cosine similarity)

Downweight large trees:

75

ImprovementsManaging tree size:

Normalize!! (like cosine similarity)

Downweight large trees:Restrict depth: just thresholdRescale with weight λ

76

RescalingGiven two subtrees rooted at n1 and n2

If productions at n1 and n2 are different,C(n1,n2) = 0

If productions at n1 and n2 are the same,And n1 and n2 are preterminals,

C(n1,n2) =λElse:

0<λ<= 1

77

RescalingGiven two subtrees rooted at n1 and n2

If productions at n1 and n2 are different,C(n1,n2) = 0

If productions at n1 and n2 are the same,And n1 and n2 are preterminals,

C(n1,n2) =λElse:

0<λ<= 1

78

Case Study

79

Parsing ExperimentData:

Penn Treebank ATIS corpus segmentTraining: 800 sentences

Top 20 parsesDevelopment: 200 sentencesTest: 336 sentences

Select best candidate from top 100 parses

80

Parsing ExperimentData:

Penn Treebank ATIS corpus segmentTraining: 800 sentences

Top 20 parsesDevelopment: 200 sentencesTest: 336 sentences

Select best candidate from top 100 parses

Classifier: Voted perceptron Kernelized like SVM, more computationally

tractable

81

Parsing ExperimentData:

Penn Treebank ATIS corpus segmentTraining: 800 sentences

Top 20 parsesDevelopment: 200 sentencesTest: 336 sentences

Select best candidate from top 100 parses

Classifier: Voted perceptron Kernelized like SVM, more computationally tractable

Evaluation: 10 runs, average parse score reported

82

Experimental ResultsBaseline system: 74%

Substantial improvement: 6% absolute score

83

SummaryParsing as reranking problem

Tree Kernel:Computes similarity between trees based on

fragmentsEfficient recursive computation procedure

Yields improved performance on parsing task