2010/2/4Yi-Ting Huang Pennacchiotti, M., & Zanzotto, F. M. Learning Shallow Semantic Rules for Textual Entailment. Recent Advances in Natural Language

1 2010/2/4 Yi-Ting Huang

Pennacchiotti, M., & Zanzotto, F. M. Learning Shallow Semantic Rules for Textual Entailment. Recent Advances in Natural Language Processing(RANLP2007).

Zanzotto, F. M. & Moschitti, A. Automatic Learning of Textual Entailments with Cross-pair similarities. ACL2006.

2

Recognizing Textual Entailment (RTE)

What is RTE: To determine whether or not a text T entails a hypothesis H.

Example: T1: “At the end of the year, all solid companies pay dividends.” H1: “At the end of the year, all solid insurance companies pay

dividends.”

H2: “At the end of the year, all solid companies pay cash dividends.”

Why RTE is important: It will allow us to model more accurate semantic theories of

natural languages and design important applications(QA or IE, etc.)

3

Idea… (1/2)

T3H3? T3: “All wild animals eat plants that have scientifically proven

medicinal properties.” “All wild mountain animals eat plants that have scientifically

proven medicinal properties.”

T1: “At the end of the year, all solid companies pay dividends.” H1: “At the end of the year, all solid insurance companies pay

dividends.” Yes!

T3 is structurally (and somehow lexically similar) to T1 and H3 is more similar to H1 than to H2. Thus, from T1H1 we may extract rules to derive that T3H3.

4

Idea… (2/2)

We should rely not only on a intra-pair similarity between T and H but also on a cross-pair similarity between two pairs (T’,H’) and (T’’,H’’).

intra-pair similarity

cross-pair similarity

5

Research purpose

In this paper, we define a new cross-pair similarity measure based on text and hypothesis syntactic trees and we use such similarity with traditional intra-pair similarities to define a novel semantic kernel function.

We experimented with such kernel using Support Vector Machines on the test tests of the Recognizing Textual Entailment (RTE) challenges.

6

Term definition

word wt in Text T, word wh in Hypothesis H anchors is the pairs (wt, wh), e.g. indicate (companies, companies) calls placeholders.

7

Intra-pair similarity (1/3)

Intra-pair similarity: to anchor the content words in the hypothesis WH to words in the text WT. Each word wh in WH is connected to all words wt in WT.

that have the highest similarity simw(wt, wh).As result, we have a set of anchors and the subset of words in T connected with a word in H.

We select the final anchor set as the bijective relation between WT and WT ‘ that mostly satisfies a locality criterion: whenever possible, words of constituent in H should be related to words of a constituent in T .

8


1. Two words are maximally similar if these have the same surface form

2. To use one of the WordNet (Miller, 1995) similarities indicated with d(lw, lw’) (Corley and Mihalcea, 2005). We adopted the wn::similarity package (Pedersen et al., 2004) to compute the Jiang&Conrath (J&C) distance (Jiang and Conrath, 1997) .

3. We can use WordNet 2.' (Miller, 1995) to extract different relation between words such as the lexical entailment between verbs (Ent) and derivationally relation between words (Der).

4. We use the edit distance measure lev(wt,wh) to capture the similarity between words that are missed by the previous analysis for misspelling errors or for the lack of derivationally forms not coded in WordNet.

and the lemmatized form lw of a word w

9


The above word similarity measure can be used to compute the similarity between T and H. In line with (Corley and Mihalcea, 2''5),

where idf(w) is the inverse document frequency of the word w. A selected portion of the British National Corpus to compute the inverse document frequency (idf). We assigned the maximum idf to words not found in the BNC.

10

Cross-pair syntactic kernels

To capture the number of common subtrees between texts (T’, T’’) and hypotheses (H’, H’’) that share the same anchoring scheme respectively. to derive the best mapping between placeholder sets. a cross-pair similarity

11

The best mapping

Let A’ and A’’ be the placeholders of (T’,H’) and (T’’,H’’), |A’|≥|A’’| and we align a subset of A’ to A’’.

Let C be the set of all bijective mappings from , an element is substitution function.

The best alignment:

where (i) returns the syntactic tree of the text S with placeholders replaced by means of the substitution c.(ii) i is the identity substitution(iii) is a function that measures the similarity between the two trees t1, t2.

' :| ' | | '' | ''a A a A toA c C

max arg max ( ( ( ', ), ( '', )) ( ( ', ), ( '', )))c C T Tc K t H c t H i K t T c t T i ( , )t S c

T 1 2K ( , )t t

12

Example

max arg max ( ( ( ', ), ( '', )) ( ( ', ), ( '', )))c C T Tc K t H c t H i K t T c t T i where (i) returns the syntactic tree of the text S with placeholders replaced by means of the substitution c.(ii) i is the identity substitution(iii) is a function that measures the similarity between the two trees t1, t2.

( , )t S c

T 1 2K ( , )t t

13

Cross-pair similarity

A tree kernel function over t1 and t2 is KT(t1, t2) = ,where Nt1 and Nt2 are the sets of the t1’s and t2’s nodes, respectively.

Given a subtree space F = , the indicator function Ii(n) is equal to 1 if the target fi is rooted at node n and equal to 0 otherwise.

In turnis the number of levels of the subtree fi .

Thus assigns a lower weight to larger fragments. when = 1, is equal to the number of common fragments rooted at nodes n1 and n2.

where as KT (t1, t2) we use the tree kernel function defined in (Collins and Duffy, 2002).

K (( ', '), ( '', '')) max ( ( ( ', ), ( '', )) ( ( ', ), ( '', )))s c C T TT H T H K t H c t H i K t T c t T i

14

Example

K (( ', '), ( '', '')) max ( ( ( ', ), ( '', )) ( ( ', ), ( '', )))s c C T TT H T H K t H c t H i K t T c t T i

Ii(n) is equal to 1 if the target fi is rooted at node n and equal to 0 otherwise

KT(t1, t2) =

15

Kernel function in SVM

The KT function has been proven to be a valid kernel, i.e. its associated Gram matrix is positive semidefinite.

Some basic operations on kernel functions, e.g. the sum, are closed with respect to the set of valid kernels.

The cross-pair similarity would be a valid kernel and we could use it in kernel based machines like SVMs.

We developed SVM-light-TK (Moschitti, 2006) which encodes the basic tree kernel function, KT , in SVM light (Joachims, 1999). We used such software to implement Ks, K1, K2 and Ks + Ki kernels (i {1, 2}).

16

Limit

The limitation of the cross-pair similarity measure is then that placeholders do not convey the semantic knowledge needed in cases such as the above, where the semantic relation between connected verbs is essential.

17

Adding semantic information

Defining anchor types: A valuable source of relation types among words is WordNet.

similar according to the WordNet similarity measure, to capture synonymy and hyperonymy.

surface matching when words or lemmas match, it captures semantically equivalent words.

18

Augmenting placeholders with anchortypes (1/2)

typed anchor model (ta) : anchor types augment only the pre-terminal nodes of the syntactic tree;

propagated typed anchor model (tap) : anchors climb up in the syntactic tree according to some specific climbing-up rules, similarly to what done for placeholders. Climbing-up rules: they climb up in the tree according

to constituent nodes in the syntactic trees take the placeholder of their semantic heads.

19

propagated typed anchor model (tap)

if two fragment have the same syntactic structure S(NP, V P(AUX,NP)), and there is a semantic equivalence (=) on all constituents, then entailment hold.

20

Augmenting placeholders with anchortypes (2/2)

New Rule: if two typed anchors climb up to the same node, give precedence to that with the highest ranking in the ordered set of type

21

Experimental I

Data set: D1, T1 and D2, T2, are the development and the test sets of the first (Dagan et al., 2005) and second (Bar Haim et al., 2006) challenges.

The positive examples constitute the 50% of the data. ALL is the union of D1, D2, and T1, which we also

split in 70%-30%. D2(50%)0 and D2(50%)00 is a random split of D2.

(homogeneous split.) Tool: The Charniak parser (Charniak, 2000) and the

morpha lemmatiser (Minnen et al., 2001) to carry out the syntactic and morphological analysis.

22

Results in Experiment I

23

Finding in Experiment I

The dramatic improvement observed in (Corley and Mihalcea, 2005) on the dataset “Train:D1-Test:T1” is given by the idf rather than the use of the J&C similarity (second vs. third columns).

our approach (last column) is significantly better than all the other methods as it provides the best result for each combination of training and test sets.

By comparing the average on all datasets, our system improves on all the methods by at least 3 absolute percent points.

The accuracy produced by Synt Trees with placeholders is higher than the one obtained with Only Synt Trees.

24

Experimental II

we compare our ta and tap approaches with the strategies for RTE: lexical overlap, syntactic matching and entailment triggering.

Data set: D2, T2 (Bar Haim et al., 2006) RTE2 challenges.

We here adopt 4-fold cross validation.

25

Experiment II

Variables: tree: the first algorithm lex: lexical overlap similarity (Corley and Mihalcea, 2005). synt: syntactic matching. synt(T,H) is used to compute the

score, by comparing all the substructures of the dependency trees of T and H. synt(T,H) = KT(T,H)/|H| where |H| is the number of subtrees in H

lex+trig: SVO that tests if T and H share a similar subj-verb-obj construct; Apposition that tests if H is a sentence headed by the verb to be and

in T there is an apposition that states H; Anaphora that tests if the SVO sentence in H has a similar wh-

sentence in T and the wh-pronoun may be resolved in T with a word similar to the object or the subject of H.

26

Results in Experiment II

27

Finding in Experiment II

Syntax structure: this demonstrates that syntax is not enough, and that lexical-semantic knowledge, and in particular the explicit representation of word level relations, plays a key role in RTE.

+1.12%

+4.19%

28

Finding in Experiment II

Also, tap outperforms lex, supporting a complementary conclusion: lexical-semantic knowledge does not cover alone the entailment phenomenon, but needs some syntactic evidence.

the use of cross-pair similarity together with lexical overlap (lex + tree) is successful, as accuracy improves +1.87% and +2.33% over the related basic methods (respectively lex and tree).

+0.66%

29

Conclusions

We have presented a model for the automatic learning of rewrite rules for textual entailments from examples.

For this purpose, we devised a novel powerful kernel based on cross-pair similarities.

Effectively integrating semantic knowledge in textual entailment recognition.

We experimented with such kernel using Support Vector Machines on the RTE test sets.

30

More information

TREE KERNELS IN SVM-LIGHTTREE KERNELS IN SVM-LIGHT has released on the website, which is implemented by the first algorithm.

http://disi.unitn.it/moschitti/Tree-Kernel.htm




Documents

2010/2/4Yi-Ting Huang Pennacchiotti, M., & Zanzotto, F. M. Learning Shallow Semantic Rules for Textual Entailment. Recent Advances in Natural Language