Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012

Tree Kernels for Parsing:

(Collins & Duffy, 2001)Advanced Statistical Methods in NLP

Ling 572February 28, 2012

2

RoadmapCollins & Duffy, 2001

Tree Kernels for Parsing:MotivationParsing as reranking

Tree kernels for similarity

Case study: Penn Treebank parsing

3

Motivation: ParsingParsing task:

Given a natural language sentence, extract its syntactic structureSpecifically, generate a corresponding parse tree

4



Approaches:

5



Approaches:“Classical” approach:

Hand-write CFG productions; use standard alg, e.g. CKY

6



Approaches: “Classical” approach:

Hand-write CFG productions; use standard alg, e.g. CKY Probabilistic approach:

Build large treebank of parsed sentencesLearn production probabilitiesUse probabilistic versions of standard algPick highest probability parse

7

Parsing IssuesMain issues:

8


Robustness: get reasonable parse for any inputAmbiguity: select best parse given alternatives

“Classic” approach:

9



“Classic” approach:Hand-coded grammars often fragileNo obvious mechanism to select among

alternatives

Probabilistic approach:

10



“Classic” approach:Hand-coded grammars often fragileNo obvious mechanism to select among alternatives

Probabilistic approach:Fairly good robustness, small probabilities for anySelect by probability, but decisions are local

Hard to capture more global structure

11

Approach: Parsing by Reranking

Intuition: Identify collection of candidate parses for each

sentencee.g. output of PCFG parserFor training, identify gold standard parse, sentence pair

12




Create a parse tree vector representationIdentify (more global) parse tree features

13





Train a reranker to rank gold standard highest

14


Intuition: Identify collection of candidate parses for each sentence

e.g. output of PCFG parserFor training, identify gold standard parse, sentence pair


Train a reranker to rank gold standard highest

Apply to rerank candidate parses for new sentence

15

Parsing as Reranking, Formally

Training data pairs: {(si,ti)}

where si is a sentence, ti is a parse tree

16




17




C(si) = {xij}:

Candidate parses for si

wlog, xi1 is the correct parse for si

18




C(si) = {xij}:



h(xij): feature vector representation of xij

19




C(si) = {xij}:




Training: Learn

20




C(si) = {xij}:




Training: Learn

Decoding: Compute

21

Parsing as Reranking: Training

Consider the hard-margin SVM model:Minimize ||w||2 subject to constraints

22



What constraints?

23



What constraints?Here, ranking constraints:

Specifically, correct parse outranks all other candidates

24





Formally,

25





Formally,

26





Formally,

27

Reformulating with αTraining learns αij , such that

28


Note: just like SVM equation, w/different constraint

29



Parse scoring:

30



Parse scoring: After substitution, we have

31




After the kernel trick, we have

32




After the kernel trick, we have

Note: With a suitable kernel K, don’t need h(x)s

33

Parsing as reranking:Perceptron algorithm

Similar to SVM, learns separating hyperplane

34


Similar to SVM, learns separating hyperplaneModeled with weight vector w Using simple iterative procedure

Based on correcting errors in current model

35




Initialize αij=0

36




Initialize αij=0

For i=1,…,n; for j=2,…,n If f(xi1)>f(xij):

continue

else: αij += 1

37

Defining the KernelSo, we have a model:

Framework for trainingFramework for decoding

38



But need to define a kernel K

39




We need: K: X x X R

40




We need: K: X x X R

Recall that X is a tree, and K is a similarity function

41

What’s in a Kernel?What are good attributes of a kernel?

42


Capture similarity between instancesHere, between parse trees

43



Capture more global parse information than PCFG

44



Capture more global parse information than PCFG

Computable tractably, even over complex, large trees

45

Tree Kernel ProposalIdea:

PCFG models learn MLE probabilities on rewrite rulesNP N vs NP DT N vs NP PN vs NP DT JJ NLocal to parent:children levels

46

Tree Kernel ProposalIdea:

PCFG models learn MLE probabilities on rewrite rulesNP N vs NP DT N vs NP PN vs NP DT JJ NLocal to parent:children levels

New measure incorporates all tree fragments in parseCaptures higher order, longer distances

dependencies Track counts of individual rules + much more

47

Tree Fragment ExampleFragments of NP over ‘apple’

Not exhaustive

48

Tree RepresentationTree fragments:

Any subgraph with more than one node Restriction: Must include full (not partial) rule

productions

Parse tree representation:h(T) = (h1(T),h2(T),…hn(T))

n: number of distinct tree fragments in training datahi(T): # of occurrences of ith tree fragment in current

tree

49

Tree RepresentationTree fragments:

Any subgraph with more than one node Restriction: Must include full (not partial) rule

productions

Parse tree representation:h(T) = (h1(T),h2(T),…hn(T))

n: number of distinct tree fragments in training datahi(T): # of occurrences of ith tree fragment in current

tree

50

Tree RepresentationPros:

51


Fairly intuitive model Natural inner product interpretationCaptures long- and short-range dependencies

Cons:

52


Fairly intuitive model Natural inner product interpretationCaptures long- and short-range dependencies

Cons:Size!!!: # subtrees exponential in size of treeDirect computation of inner product intractable

53

Key Challenge

54

Key ChallengeEfficient computation:

Find a kernel that can compute similarity efficientlyIn terms of common subtrees

55



Pure enumeration clearly intractable

56



Pure enumeration clearly intractable

Compute recursively over subtreesUsing a polynomial process

57

Counting Common Subtrees

Example: C(n1,n2): number of common subtrees rooted at

n1,n2C(n1,n2):

Due to F. Xia

aa red apple

58

Calculating C(n1,n2)Given two subtrees rooted at n1 and n2

If productions at n1 and n2 are different,

59


If productions at n1 and n2 are different,C(n1,n2) = 0

If productions at n1 and n2 are the same,And n1 and n2 are preterminals,

60




C(n1,n2) =1Else:

61




C(n1,n2) =1Else:

nc(n1): # children of n1: What about n2?

62




C(n1,n2) =1Else:

nc(n1): # children of n1: What about n2? same production same # children

ch(n1,j) jth child of n1

63

Components of Tree Representation

Tree representation: h(T) = (h1(T),h2(T),…hn(T))

Consider 2 trees: T1, T2

Number of nodes: N1,N2 , respectively

64





Define: Ii(n) = 1if ith subtree is rooted at n, 0 o.w.

65






Then,

66






Then,

67

Tree Kernel Computation

68

Tree Kernel ComputationRunning time: K(T1,T2)

O(N1N2): Based on recursive computation of C

69



Remaining Issues:

70



Remaining Issues:K(T1,T2) depends on size of T1 and T2

71



Remaining Issues:K(T1,T2) depends on size of T1 and T2

K(T1,T1) >> K(T1,T2), T1 != T2106 vs 102 : Very ‘peaked’

72

ImprovementsManaging tree size:

73


Normalize!! (like cosine similarity)

74



Downweight large trees:

75



Downweight large trees:Restrict depth: just thresholdRescale with weight λ

76

RescalingGiven two subtrees rooted at n1 and n2



C(n1,n2) =λElse:

0<λ<= 1

77

RescalingGiven two subtrees rooted at n1 and n2



C(n1,n2) =λElse:

0<λ<= 1

78

Case Study

79

Parsing ExperimentData:

Penn Treebank ATIS corpus segmentTraining: 800 sentences

Top 20 parsesDevelopment: 200 sentencesTest: 336 sentences

Select best candidate from top 100 parses

80





Classifier: Voted perceptron Kernelized like SVM, more computationally

tractable

81





Classifier: Voted perceptron Kernelized like SVM, more computationally tractable

Evaluation: 10 runs, average parse score reported

82

Experimental ResultsBaseline system: 74%

Substantial improvement: 6% absolute score

83

SummaryParsing as reranking problem

Tree Kernel:Computes similarity between trees based on

fragmentsEfficient recursive computation procedure

Yields improved performance on parsing task

Documents

Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012