Representing Meaning with a Combination of Logical and … · 2017. 2. 2. · assumptions of the probabilistic logic (Beltagy and Erk 2015). For distributional lexical and phrasal

Representing Meaning with a Combinationof Logical and Distributional Models

I. Beltagy∗The University of Texas at Austin

Stephen Roller∗The University of Texas at Austin

Pengxiang Cheng∗The University of Texas at Austin

Katrin Erk∗∗The University of Texas at Austin

Raymond J. Mooney∗The University of Texas at Austin

NLP tasks differ in the semantic information they require, and at this time no single seman-tic representation fulfills all requirements. Logic-based representations characterize sentencestructure, but do not capture the graded aspect of meaning. Distributional models give gradedsimilarity ratings for words and phrases, but do not capture sentence structure in the same detailas logic-based approaches. It has therefore been argued that the two are complementary.

We adopt a hybrid approach that combines logical and distributional semantics usingprobabilistic logic, specifically Markov Logic Networks. In this article, we focus on the threecomponents of a practical system:1 1) Logical representation focuses on representing the inputproblems in probabilistic logic; 2) knowledge base construction creates weighted inferencerules by integrating distributional information with other sources; and 3) probabilistic in-ference involves solving the resulting MLN inference problems efficiently. To evaluate ourapproach, we use the task of textual entailment, which can utilize the strengths of both logic-based and distributional representations. In particular we focus on the SICK data set, wherewe achieve state-of-the-art results. We also release a lexical entailment data set of 10,213 rulesextracted from the SICK data set, which is a valuable resource for evaluating lexical entailmentsystems.2

∗ Computer Science Department, The University of Texas at Austin. E-mail: {beltagy, roller,pxcheng}@cs.utexas.edu.

∗∗ Linguistics Department, The University of Texas at Austin. E-mail: [email protected] System is available for download at: https://github.com/ibeltagy/pl-semantics.2 Available at: https://github.com/ibeltagy/rrr.

Submission received: 10 April 2016; accepted for publication: 8 May 2016.

doi:10.1162/COLI a 00266

© 2017 Association for Computational Linguistics

Computational Linguistics Volume 42, Number 4

1. Introduction

Computational semantics studies mechanisms for encoding the meaning of naturallanguage in a machine-friendly representation that supports automated reasoning andthat, ideally, can be automatically acquired from large text corpora. Effective semanticrepresentations and reasoning tools give computers the power to perform complexapplications like question answering. But applications of computational semantics arevery diverse and pose differing requirements on the underlying representational for-malism. Some applications benefit from a detailed representation of the structure ofcomplex sentences. Some applications require the ability to recognize near-paraphrasesor degrees of similarity between sentences. Some applications require inference, eitherexact or approximate. Often, it is necessary to handle ambiguity and vagueness inmeaning. Finally, we frequently want to learn knowledge relevant to these applicationsautomatically from corpus data.

There is no single representation for natural language meaning at this time thatfulfills all of these requirements, but there are representations that fulfill some of them.Logic-based representations (Montague 1970; Dowty, Wall, and Peters 1981; Kamp andReyle 1993), like first-order logic, represent many linguistic phenomena like negation,quantifiers, or discourse entities. Some of these phenomena (especially negation scopeand discourse entities over paragraphs) cannot be easily represented in syntax-basedrepresentations like Natural Logic (MacCartney and Manning 2009). In addition, first-order logic has standardized inference mechanisms. Consequently, logical approacheshave been widely used in semantic parsing where it supports answering complexnatural language queries requiring reasoning and data aggregation (Zelle and Mooney1996; Kwiatkowski et al. 2013; Pasupat and Liang 2015). But logic-based representationsoften rely on manually constructed dictionaries for lexical semantics, which can resultin coverage problems. And first-order logic, being binary in nature, does not capturethe graded aspect of meaning (although there are combinations of logic and proba-bilities). Distributional models (Turney and Pantel 2010) use contextual similarity topredict the graded semantic similarity of words and phrases (Landauer and Dumais1997; Mitchell and Lapata 2010), and to model polysemy (Schutze 1998; Erk and Pado2008; Thater, Furstenau, and Pinkal 2010). But at this point, fully representing structureand logical form using distributional models of phrases and sentences is still an openproblem. Also, current distributional representations do not support logical inferencethat captures the semantics of negation, logical connectives, and quantifiers. Therefore,distributional models and logical representations of natural language meaning are com-plementary in their strengths, as has frequently been remarked (Coecke, Sadrzadeh, andClark 2011; Garrette, Erk, and Mooney 2011; Grefenstette and Sadrzadeh 2011; Baroni,Bernardi, and Zamparelli 2014).

Our aim has been to construct a general-purpose natural language understandingsystem that provides in-depth representations of sentence meaning amenable to au-tomated inference, but that also allows for flexible and graded inferences involvingword meaning. Therefore, our approach combines logical and distributional methods.Specifically, we use first-order logic as a basic representation, providing a sentencerepresentation that can be easily interpreted and manipulated. However, we also usedistributional information for a more graded representation of words and short phrases,providing information on near-synonymy and lexical entailment. Uncertainty andgradedness at the lexical and phrasal level should inform inference at all levels, so werely on probabilistic inference to integrate logical and distributional semantics. Thus,our system has three main components, all of which present interesting challenges.

764

Beltagy et al. Meaning Using Logical and Distributional Models

For logic-based semantics, one of the challenges is to adapt the representation to theassumptions of the probabilistic logic (Beltagy and Erk 2015). For distributional lexicaland phrasal semantics, one challenge is to obtain appropriate weights for inferencerules (Roller, Erk, and Boleda 2014). In probabilistic inference, the core challenge isformulating the problems to allow for efficient Markov Logic Network (MLN) inference(Beltagy and Mooney 2014).

Our approach has previously been described in Garrette, Erk, and Mooney (2011)and Beltagy et al. (2013). We have demonstrated the generality of the system by applyingit to both textual entailment (RTE-1 in Beltagy et al. [2013], SICK [preliminary results]and FraCas in Beltagy and Erk [2015]) and semantic textual similarity (Beltagy, Erk,and Mooney 2014), and we are investigating applications to question answering. Wehave demonstrated the modularity of the system by testing both MLNs (Richardsonand Domingos 2006) and Probabilistic Soft Logic (Broecheler, Mihalkova, and Getoor2010) as probabilistic inference engines (Beltagy et al. 2013; Beltagy, Erk, and Mooney2014).

The primary aim of the current article is to describe our complete system in detail—all the nuts and bolts necessary to bring together the three distinct components of ourapproach—and to showcase some of the difficult problems that we face in all three areas,along with our current solutions.

The secondary aim of this article is to show that it is possible to take this generalapproach and apply it to a specific task—here, textual entailment (Dagan et al. 2013)—adding task-specific aspects to the general framework in such a way that the modelachieves state-of-the-art performance. We chose the task of textual entailment becauseit utilizes the strengths of both logical and distributional representations. We specificallyuse the SICK dataset (Marelli et al. 2014b) because it was designed to focus on lexicalknowledge rather than world knowledge, matching the focus of our system.

Our system is flexible with respect to the sources of lexical and phrasal knowledge ituses, and in this article we utilize PPDB (Ganitkevitch, Van Durme, and Callison-Burch2013) and WordNet, along with distributional models. But we are specifically interestedin distributional models, in particular, in how well they can predict lexical and phrasalentailment. Our system provides a unique framework for evaluating distributionalmodels on recognizing textual entailment (RTE) because the overall sentence represen-tation is handled by the logic, so we can zoom in on the performance of distributionalmodels at predicting lexical (Geffet and Dagan 2005) and phrasal entailment. The eval-uation of distributional models on RTE is the third aim of our article. We build a lexicalentailment classifier that exploits both task-specific features as well as distributionalinformation, and present an in-depth evaluation of the distributional components.

We now provide a brief sketch of our framework (Garrette, Erk, and Mooney 2011;Beltagy et al. 2013). Our framework is three components. The first is the logical form,which is the primary meaning representation for a sentence. The second is the distri-butional information, which is encoded in the form of weighted logical rules (first-orderformulas). For example, in its simplest form, our approach can use the distributionalsimilarity of the words grumpy and sad as the weight on a rule that says if x is grumpy,then there is a chance that x is also sad:

∀x.grumpy(x)→ sad(x) | f (sim( ~grumpy, ~sad))

where ~grumpy and ~sad are the vector representations of the words grumpy and sad, simis a distributional similarity measure, like cosine, and f is a function that maps the

765


similarity score to an MLN weight. A more principled, and in fact, superior, choiceis to use an asymmetric similarity measure to compute the weight, as we discusssubsequently.

The third component is inference. We draw inferences over the weighted rulesusing MLNs (Richardson and Domingos 2006), a Statistical Relational Learning tech-nique (Getoor and Taskar 2007) that combines logical and statistical knowledge inone uniform framework, and provides a mechanism for coherent probabilistic infer-ence. MLNs represent uncertainty in terms of weights on the logical rules, as in thisexample:

∀x. ogre(x)⇒ grumpy(x) | 1.5

∀x, y. ( friend(x, y) ∧ ogre(x))⇒ ogre(y) | 1.1(1)

which states that there is a chance that ogres are grumpy, and friends of ogrestend to be ogres too. Markov logic uses such weighted rules to derive a prob-ability distribution over possible worlds through an undirected graphical model.This probability distribution over possible worlds is then used to draw infer-ences.

We publish a data set of the lexical and phrasal rules that our system queries whenrunning on SICK, along with gold standard annotations. The training and testing setsare extracted from the SICK training and testing sets, respectively. The total number ofrules (training + testing) is 12,510—only 10,211 are unique with 3,106 entailing rules,177 contradictions, and 6,928 neutral. This is a valuable resource for testing lexical en-tailment systems, containing a variety of entailment relations (hypernymy, synonymy,antonymy, etc.) that are actually useful in an end-to-end RTE system.

In addition to providing further details on the approach introduced in Garrette, Erk,and Mooney (2011) and Beltagy et al. (2013) (including improvements that improve thescalability of MLN inference [Beltagy and Mooney 2014] and adapt logical constructsfor probabilistic inference [Beltagy and Erk 2015]), this article makes the following newcontributions:r We show how to represent the RTE task as an inference problem in

probabilistic logic (Sections 4.1, 4.2), arguing for the use of a closed-wordassumption (Section 4.3).r Contradictory RTE sentence pairs are often only contradictory given someassumption about entity coreference. For example, An ogre is not snoringand An ogre is snoring are not contradictory unless we assume that the twoogres are the same. Handling such coreferences is important to detectingmany cases of contradiction (Section 4.4).r We use multiple parses to reduce the impact of misparsing (Section 4.5).r In addition to distributional rules, we add rules from existing databases,in particular WordNet (Princeton University 2010) and the paraphrasecollection PPDB (Ganitkevitch, Van Durme, and Callison-Burch 2013)(Section 5.3).r We provide a logic-based alignment to guide generation of distributionalrules (Section 5.1).

766


r We provide a data set of all lexical and phrasal rules needed forthe SICK data set (10,211 rules). This is a valuable resourcefor testing lexical entailment systems on entailment relationsthat are actually useful in an end-to-end RTE system(Section 5.1).r We evaluate a state-of-the-art compositional distributional approach(Paperno, Pham, and Baroni 2014) on the task of phrasal entailment(Section 5.2.5).r We propose a simple weight learning approach to map rule weights toMLN weights (Section 6.3).r The question “Do supervised distributional methods really learn lexicalinference relations?” (Levy et al. 2015) has been studied before on a varietyof lexical entailment data sets. For the first time, we study it on data froman actual RTE data set and show that distributional information is usefulfor lexical entailment (Section 7.1).r Marelli et al. (2014a) report that for the SICK data set used in SemEval2014, the best result was achieved by systems that did not compute asentence representation in a compositional manner. We present a modelthat performs deep compositional semantic analysis and achievesstate-of-the-art performance (Section 7.2).

2. Background

Logical Semantics. Logical representations of meaning have a long tradition in lin-guistic semantics (Montague 1970; Dowty, Wall, and Peters 1981; Alshawi 1992; Kampand Reyle 1993) and computational semantics (Blackburn and Bos 2005; van Eijck andUnger 2010), and are commonly used in semantic parsing (Zelle and Mooney 1996;Berant et al. 2013; Kwiatkowski et al. 2013). They handle many complex semanticphenomena, such as negation and quantifiers, and they identify discourse referentsalong with the predicates that apply to them and the relations that hold between them.However, standard first-order logic and theorem provers are binary in nature, whichprevents them from capturing the graded aspects of meaning in language: Synonymyseems to come in degrees (Edmonds and Hirst 2000), as does the difference betweensenses in polysemous words (Brown 2008). van Eijck and Lappin (2012) write: “Thecase for abandoning the categorical view of competence and adopting a probabilisticmodel is at least as strong in semantics as it is in syntax.”

Recent wide-coverage tools that use logic-based sentence representations includeCopestake and Flickinger (2000), Bos (2008), and Lewis and Steedman (2013). We useBoxer (Bos 2008), a wide-coverage semantic analysis tool that produces logical forms,using Discourse Representation Structures (Kamp and Reyle 1993). It builds on the C&CCCG (Combinatory Categorial Grammar) parser (Clark and Curran 2004) and mapssentences into a lexically based logical form, in which the predicates are mostly wordsin the sentence. For example, the sentence An ogre loves a princess is mapped to:

∃x, y, z. ogre(x) ∧ agent(y, x) ∧ love(y) ∧ patient(y, z) ∧ princess(z) (2)

767


As can be seen, Boxer uses a neo-Davidsonian framework (Parsons 1990): y is an eventvariable, and the semantic roles agent and patient are turned into predicates linking y tothe agent x and patient z.

As we discuss later, we combine Boxer’s logical form with weighted rules andperform probabilistic inference. Lewis and Steedman (2013) also integrate logical anddistributional approaches, but use distributional information to create predicates for astandard binary logic and do not use probabilistic inference. Much earlier, Hobbs et al.(1988) combined logical form with weights in an abductive framework. There, the aimwas to model the interpretation of a passage as its best possible explanation.

Distributional Semantics. Distributional models (Turney and Pantel 2010) use statis-tics on contextual data from large corpora to predict semantic similarity of words andphrases (Landauer and Dumais 1997; Mitchell and Lapata 2010). They are motivatedby the observation that semantically similar words occur in similar contexts, so wordscan be represented as vectors in high dimensional spaces generated from the contextsin which they occur (Lund and Burgess 1996; Landauer and Dumais 1997). Therefore,distributional models are relatively easier to build than logical representations, auto-matically acquire knowledge from “big data,” and capture the graded nature of linguisticmeaning, but they do not adequately capture logical structure (Grefenstette 2013).

Distributional models have also been extended to compute vector representa-tions for larger phrases, for example, by adding the vectors for the individual words(Landauer and Dumais 1997) or by a component-wise product of word vectors (Mitchelland Lapata 2008, 2010), or through more complex methods that compute phrase vec-tors from word vectors and tensors (Baroni and Zamparelli 2010; Grefenstette andSadrzadeh 2011).

Integrating Logic-Based and Distributional Semantics. It does not seem particularly use-ful at this point to speculate about phenomena that either a distributional approach or alogic-based approach would not be able to handle in principle, as both frameworks arecontinually evolving. However, logical and distributional approaches clearly differ inthe strengths that they currently possess (Coecke, Sadrzadeh, and Clark 2011; Garrette,Erk, and Mooney 2011; Baroni, Bernardi, and Zamparelli 2014). Logical form excels atin-depth representations of sentence structure and provides an explicit representationof discourse referents. Distributional approaches are particularly good at representingthe meaning of words and short phrases in a way that allows for modeling degrees ofsimilarity and entailment and for modeling word meaning in context. This suggests thatit may be useful to combine the two frameworks.

Another argument for combining both representations is that it makes sense froma theoretical point of view to address meaning, a complex and multifaceted phe-nomenon, through a combination of representations. Meaning is about truth, and logicalapproaches with a model-theoretic semantics nicely address this facet of meaning.Meaning is also about a community of speakers and how they use language, anddistributional models aggregate observed uses from many speakers.

There are few hybrid systems that integrate logical and distributional information,and we discuss some of them here.

Beltagy et al. (2013) transform distributional similarity to weighted distributionalinference rules that are combined with logic-based sentence representations, and useprobabilistic inference over both. This is the approach that we build on in this article.Lewis and Steedman (2013), on the other hand, use clustering on distributional data toinfer word senses, and perform standard first-order inference on the resulting logical

768


forms. The main difference between the two approaches lies in the role of gradience.Lewis and Steedman view weights and probabilities as a problem to be avoided. Webelieve that the uncertainty inherent in both language processing and world knowl-edge should be front and center in all inferential processes. Tian, Miyao, and Takuya(2014) represent sentences using Dependency-based Compositional Semantics (Liang,Jordan, and Klein 2011). They construct phrasal entailment rules based on a logic-basedalignment, and use distributional similarity of aligned words to filter rules that do notsurpass a given threshold.

Also related are distributional models where the dimensions of the vectorsencode model-theoretic structures rather than observed co-occurrences (Clark 2012;Grefenstette 2013; Sadrzadeh, Clark, and Coecke 2013; Herbelot and Vecchi 2015),even though they are not strictly hybrid systems as they do not include contextualdistributional information. Grefenstette (2013) represents logical constructs usingvectors and tensors, but concludes that they do not adequately capture logicalstructure, in particular, quantifiers.

If, like Andrews, Vigliocco, and Vinson (2009), Silberer and Lapata (2012), and Bruniet al. (2012) (among others), we also consider perceptual context as part of distributionalmodels, then Cooper et al. (2015) also qualifies as a hybrid logical/distributional ap-proach. They envision a classifier that labels feature-based representations of situations(which can be viewed as perceptual distributional representations) as having a certainprobability of making a proposition true, for example smile(Sandy). These propositionsfunction as types of situations in a type-theoretic semantics.

Probabilistic Logic with Markov Logic Networks. To combine logical and probabilisticinformation, we utilize MLNs (Richardson and Domingos 2006). MLNs are well suitedfor our approach because they provide an elegant framework for assigning weightsto first-order logical rules, combining a diverse set of inference rules and performingsound probabilistic inference.

A weighted rule allows truth assignments in which not all instances of the rule hold.Equation 1 above shows sample weighted rules: Friends of ogres tend to be ogres andogres tend to be grumpy. Suppose we have two constants, Anna (A) and Bob (B). Usingthese two constants and the predicate symbols in Equation 1, the set of all ground atomswe can construct is:

LA,B = {ogre(A), ogre(B), grumpy(A), grumpy(B), friend(A, A),

friend(A, B), friend(B, A), friend(B, B)}

If we only consider models over a domain with these two constants as entities, theneach truth assignment to LA,B corresponds to a model. MLNs make the assumption of aone-to-one correspondence between constants in the system and entities in the domain.We discuss the effects of this domain closure assumption below.

Markov Networks or undirected graphical models (Pearl 1988) compute the prob-ability P(X = x) of an assignment x of values to the sequence X of all variables in themodel based on clique potentials, where a clique potential is a function that assignsa value to each clique (maximally connected subgraph) in the graph. Markov LogicNetworks construct Markov Networks (hence their name) based on weighted first orderlogic formulas, like the ones in Equation (1). Figure 1 shows the network for Equation (1)with two constants. Every ground atom becomes a node in the graph, and two nodes areconnected if they co-occur in a grounding of an input formula. In this graph, each clique

769


ogre(A)

ogre(B)

friend(A, B)friend(B, A)

friend(B, B)

friend(A, A)grumpy(A)

grumpy(B)

Figure 1A sample ground network for a Markov Logic Network.

corresponds to a grounding of a rule. For example, the clique including friend(A, B),ogre(A), and ogre(B) corresponds to the ground rule friend(A, B) ∧ ogre(A)⇒ ogre(B).A variable assignment x in this graph assigns to each node a value of either True orFalse, so it is a truth assignment (a world). The clique potential for the clique involvingfriend(A, B), ogre(A), and ogre(B) is exp(1.1) if x makes the ground rule true, and 0otherwise. This allows for nonzero probability for worlds x in which not all friendsof ogres are also ogres, but it assigns exponentially more probability to a world for eachground rule that it satisfies.

More generally, an MLN takes as input a set of weighted first-order formulas F =F1, . . . , Fn and a set C of constants, and constructs an undirected graphical model inwhich the set of nodes is the set of ground atoms constructed from F and C. It computesthe probability distribution P(X = x) over worlds based on this undirected graphicalmodel. The probability of a world (a truth assignment) x is defined as:

P(X = x) = 1Z exp

(∑i

wini (x)

)(3)

where i ranges over all formulas Fi in F, wi is the weight of Fi, ni(x) is the number ofgroundings of Fi that are true in the world x, and Z is the partition function (i.e., itnormalizes the values to probabilities). So the probability of a world increases expo-nentially with the total weight of the ground clauses that it satisfies.

In this article, we use R (for rules) to denote the input set of weighted formulas.In addition, an MLN takes as input an evidence set E asserting truth values for someground clauses. For example, ogre(A) means that Anna is an ogre. Marginal inferencefor MLNs calculates the probability P(Q|E, R) for a query formula Q.

Alchemy (Kok et al. 2005) is the most widely used MLN implementation. It is asoftware package that contains implementations of a variety of MLN inference andlearning algorithms. However, developing a scalable, general-purpose, accurate infer-ence method for complex MLNs is an open problem. MLNs have been used for variousNLP applications, including unsupervised coreference resolution (Poon and Domingos2008), semantic role labeling (Riedel and Meza-Ruiz 2008), and event extraction (Riedelet al. 2009).

Recognizing Textual Entailment. The task that we focus on in this article is RTE (Daganet al. 2013), the task of determining whether one natural language text, the Text T,entails, contradicts, or is not related (neutral) to another, the Hypothesis H. “Entailment”here does not mean logical entailment: The Hypothesis is entailed if a human annotator

770


judges that it plausibly follows from the Text. When using naturally occurring sentences,this is a very challenging task that should be able to utilize the unique strengths ofboth logic-based and distributional semantics. Here are examples from the SICK dataset (Marelli et al. 2014b):r Entailment

T: A man and a woman are walking together through the woods.

H: A man and a woman are walking through a wooded area.r Contradiction

T: Nobody is playing the guitar

H: A man is playing the guitarr Neutral

T: A young girl is dancing

H: A young girl is standing on one leg

The SICK (“Sentences Involving Compositional Knowledge”) data set, which weuse for evaluation in this article, was designed to foreground particular linguistic phe-nomena but to eliminate the need for world knowledge beyond linguistic knowledge.It was constructed from sentences from two image description data sets, ImageFlickr3

and the SemEval 2012 STS MSR-Video Description data.4 Randomly selected sentencesfrom these two sources were first simplified to remove some linguistic phenomenathat the data set was not aiming to cover. Then, additional sentences were createdas variations over these sentences, by paraphrasing, negation, and reordering. RTEpairs were then created that consisted of a simplified original sentence paired with oneof the transformed sentences (generated from either the same or a different originalsentence).

We would like to mention two particular systems that were evaluated on SICK.The first is Lai and Hockenmaier (2014), which was the top-performing system atthe original shared task. It uses a linear classifier with many hand-crafted features,including alignments, word forms, POS tags, distributional similarity, WordNet, anda unique feature called Denotational Similarity. Many of these hand-crafted featuresare later incorporated in our lexical entailment classifier, described in Section 5.2. TheDenotational Similarity uses a large database of human- and machine-generated imagecaptions to cleverly capture some world knowledge of entailments.

The second system is Bjerva et al. (2014), which also participated in the originalSICK shared task, and achieved 81.6% accuracy. The RTE system uses Boxer to parseinput sentences to logical form, then uses a theorem prover and a model builder tocheck for entailment and contradiction. The knowledge bases used are WordNet andPPDB. In contrast with our work, PPDB paraphrases are not translated to logical rules(Section 5.3). Instead, in case a PPDB paraphrase rule applies to a pair of sentences,the rule is applied at the text level before parsing the sentence. Theorem provers and

3 http://nlp.cs.illinois.edu/HockenmaierGroup/data.html.4 http://www.cs.york.ac.uk/semeval-2012/task6/index.php?id=data.

771


model builders have high precision detecting entailments and contradictions, but lowrecall. To improve recall, neutral pairs are reclassified using a set of textual, syntactic,and semantic features.

3. System Overview

This section provides an overview of our system’s architecture, using the following RTEexample to demonstrate the role of each component:

T: A grumpy ogre is not smiling.

H: A monster with a bad temper is not laughing.

Which in logic are:

T: ∃x. ogre(x) ∧ grumpy(x) ∧ ¬∃y. agent(y, x) ∧ smile(y)

H: ∃x, y.monster(x) ∧ with(x, y) ∧ bad(y) ∧ temper(y) ∧ ¬∃z. agent(z, x) ∧laugh(z).

This example needs the following rules in the knowledge base KB:

r1: laugh⇒ smile

r2: ogre⇒monster

r3: grumpy⇒with a bad temper

Figure 2 shows the high-level architecture of our system, and Figure 3 shows theMLNs constructed by our system for the given RTE example.

Figure 2System architecture.

772


Figure 3MLNs for the given RTE example. The RTE task is represented as two inferencesP(H|T, KB, WT,H ) and P(¬H|T, KB, WT,¬H ) (Section 4.1). D is the set of constants in the domain.T and r3 are skolemized and sk is the skolem function of r3 (Section 4.2). G is the set of non-False(True or unknown) ground atoms as determined by the CWA (Section 4.3, 6.2). A is the CWA forthe negated part of H (Section 4.3). D, G, A are the world assumptions WT,H ( or WT,¬H). r1, r2, r3are the KB. r1 and its weight w1 are from PPDB (Section 5.3). r2 is from WordNet (Section 5.3).r3 is constructed using the Modified Robinson Resolution (Section 5.1), and its weight w3 iscalculated using the entailment rules classifier (Section 5.2). The resource-specific weightswppdb, weclassif are learned using weight learning (Section 6.3). Finally, the two probabilities arecalculated using MLN inference where H (or ¬H) is the query formula (Section 6.1)

Our system has three main components:

1. Logical Representation (Section 4), where input natural sentences T andH are mapped into logic and then used to represent the RTE task as aprobabilistic inference problem.

2. Knowledge Base Construction KB (Section 5), where the backgroundknowledge is collected from different sources, encoded as first-order logic

773


rules, and weighted and added to the inference problem. This is wheredistributional information is integrated into our system.

3. Inference (Section 6), which uses MLNs to solve the resulting inferenceproblem.

One powerful advantage of using a general-purpose probabilistic logic as a se-mantic representation is that it allows for a highly modular system. Therefore, themost recent advancements in any of the system components, in parsing, in knowledgebase resources and distributional semantics, and in inference algorithms, can be easilyincorporated into the system.

In the Logical Representation step (Section 4), we map input sentences T and H tologic. Then, we show how to map the three-way RTE classification (entailing, neutral,or contradicting) to probabilistic inference problems. The mapping of sentences to logicdiffers from standard first order logic in several respects because of properties of theprobabilistic inference system. First, MLNs make the Domain Closure Assumption(DCA), which states that there are no objects in the universe other than the namedconstants (Richardson and Domingos 2006). This means that constants need to beexplicitly introduced in the domain in order to make probabilistic logic produce theexpected inferences. Another representational issue that we discuss is why we shouldmake the closed-world assumption, and its implications on the task representation.

In the Knowledge Base Construction step KB (Section 5), we collect inference rulesfrom a variety of sources. We add rules from existing databases, in particular WordNet(Princeton University 2010) and PPDB (Ganitkevitch, Van Durme, and Callison-Burch2013). To integrate distributional semantics, we use a variant of Robinson resolution toalign the Text T and the Hypothesis H, and to find the difference between them, whichwe formulate as an entailment rule. We then train a lexical and phrasal entailmentclassifier to assess this rule. Ideally, rules need be contextualized to handle polysemy,but we leave that to future work.

In the Inference step (Section 6), automated reasoning for MLNs is used to per-form the RTE task. We implement an MLN inference algorithm that directly sup-ports querying complex logical formula, which is not supported in the availableMLN tools (Beltagy and Mooney 2014). We exploit the closed-world assumption tohelp reduce the size of the inference problem in order to make it tractable (Beltagyand Mooney 2014). We also discuss weight learning for the rules in the knowledgebase.

4. Logical Representation

The first component of our system parses sentences into logical form and uses thisto represent the RTE problem as MLN inference. We start with Boxer (Bos 2008), arule-based semantic analysis system that translates a CCG parse into a logical form.The formula

∃x, y, z. ogre(x) ∧ agent(y, x) ∧ love(y) ∧ patient(y, z) ∧ princess(z) (4)

is an example of Boxer producing discourse representation structures using a neo-Davidsonian framework. We call Boxer’s output alone an “uninterpreted logicalform” because the predicate symbols are simply words and do not have meaning by

774


themselves. Their semantics derives from the knowledge base KB we build in Section 5.The rest of this section discusses how we adapt Boxer output for MLN inference.

4.1 Representing Tasks as Text and Query

Representing Natural Language Understanding Tasks. In our framework, a language-understanding task consists of a text and a query, along with a knowledge base. Thetext describes some situation or setting, and the query in the simplest case asks whethera particular statement is true of the situation described in the text. The knowledgebase encodes relevant background knowledge: lexical knowledge, world knowledge, orboth. In the textual entailment task, the text is the Text T, and the query is the HypothesisH. The sentence similarity (Semantic Textual Similarity; STS) task can be described astwo text/query pairs. In the first pair, the first sentence is the text and the second is thequery, and in the second pair the roles are reversed (Beltagy, Erk, and Mooney 2014).In question answering, the input documents constitute the text and the query has theform H(x) for a variable x; and the answer is the entity e such that H(e) has the highestprobability given the information in T.

In this article, we focus on the simplest form of text/query inference, which appliesto both RTE and STS: Given a text T and query H, does the text entail the query given theknowledge base KB? In standard logic, we determine entailment by checking whetherT ∧ KB⇒ H. (Unless we need to make the distinction explicitly, we overload notationand use the symbol T for the logical form computed for the text, and H for the logicalform computed for the query.) The probabilistic version is to calculate the probabilityP(H|T, KB, WT,H ), where WT,H is a world configuration, which includes the size of thedomain. We discuss WT,H in Sections 4.2 and 4.3. Although we focus on the simplestform of text/query inference, more complex tasks such as question answering still havethe probability P(H|T, KB, WT,H ) as part of their calculations.

Representing Textual Entailment. RTE asks for a categorical decision between threecategories: entailment, contradiction, and neutral. A decision about entailment canbe made by learning a threshold on the probability P(H|T, KB, WT,H ). To differen-tiate between contradiction and neutral, we additionally calculate the probabilityP(¬H|T, KB, WT,¬H ). If P(H|T, KB, WT,H ) is high and P(¬H|T, KB, WT,¬H ) is low, this in-dicates entailment. The opposite case indicates contradiction. If the two probabilityvalues are close, this means T does not significantly affect the probability of H, indicat-ing a neutral case. To learn the thresholds for these decisions, we train an SVM classifierwith LibSVM’s default parameters (Chang and Lin 2001) to map the two probabilitiesto the final decision. The learned mapping is always simple and reflects the intuitiondescribed here.

4.2 Using a Fixed Domain Size

MLNs compute a probability distribution over possible worlds, as described in Sec-tion 2. When we describe a task as a text T and a query H, the worlds over whichthe MLN computes a probability distribution are “mini-worlds,” just large enoughto describe the situation or setting given by T. The probability P(H|T, KB, WT,H ) thendescribes the probability that H would hold given the probability distribution over the

775


worlds that possibly describe T.5 The use of “mini-worlds” is by necessity, as MLNscan only handle worlds with a fixed domain size, where “domain size” is the numberof constants in the domain. (In fact, this same restriction holds for all current practicalprobabilistic inference methods, including probabilistic soft logic [Bach et al. 2013].)

Formally, the influence of the set of constants on the worlds considered by an MLNcan be described by the Domain Closure Assumption (DCA; Genesereth and Nilsson1987; Richardson and Domingos 2006): The only models considered for a set F offormulas are those for which the following three conditions hold: (a) Different constantsrefer to different objects in the domain: (b) the only objects in the domain are thosethat can be represented using the constant and function symbols in F: and (c) for eachfunction f appearing in F, the value of f applied to every possible tuple of arguments isknown, and is a constant appearing in F. Together, these three conditions entail that thereis a one-to-one relation between objects in the domain and the named constants of F. When theset of all constants is known, it can be used to ground predicates to generate the set of allground atoms, which then become the nodes in the graphical model. Different constantsets result in different graphical models. If no constants are explicitly introduced, thegraphical model is empty (no random variables).

This means that to obtain an adequate representation of an inference problemconsisting of a text T and query H, we need to introduce a sufficient number of constantsexplicitly into the formula: The worlds that the MLN considers need to have enoughconstants to faithfully represent the situation in T and not give the wrong entailmentfor the query H. In what follows, we explain how we determine an appropriate setof constants for the logical-form representations of T and H. The domain size that wedetermine is one of the two components of the parameter WT,H.

Skolemization. We introduce some of the necessary constants through the well-known technique of skolemization (Skolem 1920). It transforms a formula ∀x1 . . . xn∃y.Fto ∀x1 . . . xn.F∗, where F∗ is formed from F by replacing all free occurrences of y in F bya term f (x1, . . . , xn) for a new function symbol f . If n = 0, f is called a Skolem constant,otherwise a Skolem function. Although skolemization is a widely used technique infirst-order logic, it is not frequently used in probabilistic logic because many applica-tions do not require existential quantifiers.

We use skolemization on the text T (but not the query H, as we cannot assumea priori that it is true). For example, the logical expression in Equation (4), whichrepresents the sentence T: An ogre loves a princess, will be skolemized to:

ogre(O) ∧ agent(L, O) ∧ love(L) ∧ patient(L, N) ∧ princess(N) (5)

where O, L, N are Skolem constants introduced into the domain.Standard skolemization transforms existential quantifiers embedded under uni-

versal quantifiers to Skolem functions. For example, for the text T: All ogres snoreand its logical form ∀x. ogre(x)⇒ ∃y. agent(y, x) ∧ snore(y), the standard skolemizationis ∀x. ogre(x)⇒ agent( f (x), x) ∧ snore( f (x)). Per condition (c) of the DCA, if a Skolemfunction appeared in a formula, we would have to know its value for any constant in thedomain, and this value would have to be another constant. To achieve this, we introduce

5 Cooper et al. (2015) criticize probabilistic inference frameworks based on a probability distribution overworlds as not feasible. But what they mean by a world is a maximally consistent set of propositions.Because we use MLNs only to handle “mini-worlds” describing individual situations or settings, thiscriticism does not apply to our approach.

776


a new predicate Skolemf instead of each Skolem function f , and for every constant thatis an ogre, we add an extra constant that is a loving event. The example then becomes:

T : ∀x. ogre(x)⇒ ∀y. Skolemf (x, y)⇒ agent(y, x) ∧ snore(y)

If the domain contains a single ogre O1, then we introduce a new constant C1 and anatom Skolemf (O1, C1) to state that the Skolem function f maps the constant O1 to theconstant C1.

Existence. But how would the domain contain an ogre O1 in the case of the text T:All ogres snore, ∀x.ogre(x)⇒ ∃y.agent(y, x) ∧ snore(y)? Skolemization does not introduceany variables for the universally quantified x. We still introduce a constant O1 that isan ogre. This can be justified by pragmatics because the sentence presupposes that thereare, in fact, ogres (Strawson 1950; Geurts 2007). We use the sentence’s parse to identifythe universal quantifier’s restrictor and body, then introduce entities representing therestrictor of the quantifier (Beltagy and Erk 2015). The sentence T: All ogres snore ef-fectively changes to T: All ogres snore, and there is an ogre. At this point, skolemizationtakes over to generate a constant that is an ogre. Sentences like T: There are no ogres is aspecial case: For such sentences, we do not generate evidence of an ogre. In this case, thenon-emptiness of the domain is not assumed because the sentence explicitly negates it.

Universal Quantifiers in the Query. The most serious problem with the DCA is thatit affects the behavior of universal quantifiers in the query. Suppose we know that T:Shrek is a green ogre, represented with skolemization as ogre(SH) ∧ green(SH). Then wecan conclude that H: All ogres are green, because by the DCA we are only consideringmodels with this single constant, which we know is both an ogre and green. To addressthis problem, we again introduce new constants.

We want a query H: All ogres are green to be judged true iff there is evidence thatall ogres will be green, no matter how many ogres there are in the domain. So H shouldfollow from T2: All ogres are green but not from T1: There is a green ogre. Therefore weintroduce a new constant D for the query and assert ogre(D) to test if we can thenconclude that green(D). The new evidence ogre(D) prevents the query from being judgedtrue given T1. Given T2, the new ogre D will be inferred to be green, in which case wetake the query to be true. Again, with a query such as H: There are no ogres, we do notgenerate any evidence for the existence of an ogre.

4.3 Setting Prior Probabilities

Suppose we have an empty text T, and the query H: A is an ogre, where A is a constantin the system. Without any additional information, the worlds in which ogre(A) is trueare going to be as likely as the worlds in which the ground atom is false, so ogre(A) willhave a probability of 0.5. So without any text T, ground atoms have a prior probabilityin MLNs that is not zero. This prior probability depends mostly on the size of the set Fof input formulas. The prior probability of an individual ground atom can be influencedby a weighted rule, for example, ogre(A) | −3, with a negative weight, sets a low priorprobability on A being an ogre. This is the second group of parameters that we encodein WT,H: weights on ground atoms to be used to set prior probabilities.

Prior probabilities are problematic for our probabilistic encoding of natural lan-guage understanding problems. As a reminder, we probabilistically test for entail-ment by computing the probability of the query given the text, or, more precisely,

777


P(H|T, KB, WT,H ). However, how useful this conditional probability is as an indication ofentailment depends on the prior probability of H, P(H|KB, WT,H ). For example, if H hasa high prior probability, then a high conditional probability P(H|T, KB, WT,H ) does notadd much information because it is not clear if the probability is high because T reallyentails H, or because of the high prior probability of H. In practical terms, we would notwant to say that we can conclude from T: All princesses snore that H: There is an ogre justbecause of a high prior probability for the existence of ogres.

To solve this problem and make the probability P(H|T, KB, WT,H ) less sensitive toP(H|KB, WT,H ), we pick a particular WT,H such that the prior probability of H is approx-imately zero, P(H|KB, WT,H ) ≈ 0, so that we know that any increase in the conditionalprobability is an effect of adding T. For the task of RTE, where we need to distinguishentailment, neutral, and contradiction, this inference alone does not account for contra-dictions, which is why an additional inference P(¬H|T, KB, WT,¬H ) is needed.

For the rest of this section, we show how to set the world configurations WT,Hsuch that P(H|KB, WT,H ) ≈ 0 by enforcing the closed-world assumption (CWA). This isthe assumption that all ground atoms have very low prior probability (or are false bydefault).

Using the CWA to Set the Prior Probability of the Query to Zero. The CWA is theassumption that everything is false unless stated otherwise. We translate it to ourprobabilistic setting as saying that all ground atoms have very low prior probability.For most queries H, setting the world configuration WT,H such that all ground atomshave low prior probability is enough to achieve that P(H|KB, WT,H ) ≈ 0 (not for negatedHs, and this case is discussed subsequently). For example, H: An ogre loves a princess, inlogic is:

H : ∃x, y, z. ogre(x) ∧ agent(y, x) ∧ love(y) ∧ patient(y, z) ∧ princess(z)

Having low prior probability on all ground atoms means that the prior probability ofthis existentially quantified H is close to zero.

We believe that this set-up is more appropriate for probabilistic natural languageentailment for the following reasons. First, this aligns with our intuition of what itmeans for a query to follow from a text: that H should be entailed by T not because ofgeneral world knowledge. For example, if T: An ogre loves a princess, and H: Texas is in theUSA, then although H is true in the real world, T does not entail H. Another example:T: An ogre loves a princess, H: An ogre loves a green princess, again, T does not entail Hbecause there is no evidence that the princess is green, in other words, the ground atomgreen(N) has very low prior probability.

The second reason is that with the CWA, the inference result is less sensitive to thedomain size (number of constants in the domain). In logical forms for typical naturallanguage sentences, most variables in the query are existentially quantified. Withoutthe CWA, the probability of an existentially quantified query increases as the domainsize increases, regardless of the evidence. This makes sense in the MLN setting, becausein larger domains the probability that something exists increases. However, this is notwhat we need for testing natural language queries, as the probability of the queryshould depend on T and KB, not the domain size. With the CWA, what affects theprobability of H is the non-zero evidence that T provides and KB, regardless of thedomain size.

778


The third reason is computational efficiency. As discussed in Section 2, MarkovLogic Networks first compute all possible groundings of a given set of weighted formu-las, which can require significant amounts of memory. This is particularly striking forproblems in natural language semantics because of long formulas. Beltagy and Mooney(2014) show how to utilize the CWA to address this problem by reducing the number ofground atoms that the system generates. We discuss the details in Section 6.2.

Setting the Prior Probability of Negated H to Zero. Although using the CWA is enoughto set P(H|KB, WT,H ) ≈ 0 for most Hs, it does not work for negated H (negation is part ofH). Assuming that everything is false by default and that all ground atoms have verylow prior probability (CWA) means that all negated queries H are true by default. Theresult is that all negated H are judged entailed regardless of T. For example, T: An ogreloves a princess would entail H: No ogre snores. This H in logic is:

H : ∀x, y. ogre(x)⇒ ¬(agent(y, x) ∧ snore(y))

As both x and y are universally quantified variables in H, we generate evidence of anogre ogre(O) as described in Section 4.2. Because of the CWA, O is assumed to be doesnot snore, and H ends up being true regardless of T.

To set the prior probability of H to≈ 0 and prevent it from being assumed true whenT is just uninformative, we construct a new rule A that implements a kind of anti-CWA.A is formed as a conjunction of all the predicates that were not used to generate evidencebefore, and are negated in H. This rule A gets a positive weight indicating that its groundatoms have high prior probability. As the rule A together with the evidence generatedfrom H states the opposite of the negated parts of H, the prior probability of H is low,and H cannot become true unless T explicitly negates A. T is translated into unweightedrules, which are taken to have infinite weight, and which thus can overcome the finitepositive weight of A. Here is a neutral RTE example, T: An ogre loves a princess, and H:No ogre snores. Their representations are:

T: ∃x, y, z. ogre(x) ∧ agent(y, x) ∧ love(y) ∧ patient(y, z) ∧ princess(z)

H: ∀x, y. ogre(x)⇒ ¬(agent(y, x) ∧ snore(y))

E: ogre(O)

A: agent(S, O) ∧ snore(S)|w = 1.5

E is the evidence generated for the universally quantified variables in H, and Ais the weighted rule for the remaining negated predicates. The relation between Tand H is neutral, as T does not entail H. This means, we want P(H|T, KB, WT,H ) ≈ 0,but because of the CWA, P(H|T, KB, WT,H ) ≈ 1. Adding A solves this problem andP(H|T, A, KB, WT,H ) ≈ 0 because H is not explicitly entailed by T.

In case H contains existentially quantified variables that occur in negated predi-cates, they need to be universally quantified in A for H to have a low prior probability.For example, H: There is an ogre that is not green:

H : ∃x. ogre(x) ∧ ¬green(x)

A : ∀x. green(x)|w = 1.5

779


If one variable is universally quantified and the other is existentially quantified, we needto do something more complex. Here is an example, H: An ogre does not snore:

H : ∃x. ogre(x) ∧ ¬( ∃y. agent(y, x) ∧ snore(y) )

A : ∀v. agent(S, v) ∧ snore(S)|w = 1.5

Notes About How Inference Proceeds with the Rule A Added. If H is a negated formulathat is entailed by T, then T (which has infinite weight) will contradict A, allowing Hto be true. Any weighted inference rules in the knowledge base KB will need weightshigh enough to overcome A. So the weight of A is taken into account when computinginference rule weights.

In addition, adding the rule A introduces constants in the domain that are necessaryfor making the inference. For example, take T: No monster snores, and H: No ogre snores,which in logic are:

T: ¬∃x, y.monster(x) ∧ agent(y, x) ∧ snore(y)

H: ¬∃x, y. ogre(x) ∧ agent(y, x) ∧ snore(y)

A: ogre(O) ∧ agent(S, O) ∧ snore(S)|w = 1.5

KB: ∀x. ogre(x)⇒ monster(x)

Without the constants O and S added by the rule A, the domain would have been emptyand the inference output would have been wrong. The rule A prevents this problem. Inaddition, the introduced evidence in A fits the idea of “evidence propagation” men-tioned earlier (detailed in Section 6.2). For entailing sentences that are negated, as inthe example here, the evidence propagates from H to T (not from T to H as in non-negated examples). In the example, the rule A introduces an evidence for ogre(O) thatthen propagates from the LHS to the RHS of the KB rule.

4.4 Textual Entailment and Coreference

The adaptations of logical form that we have discussed so far apply to any naturallanguage understanding problem that can be formulated as text/query pairs. Theadaptation that we discuss now is specific to textual entailment. It concerns coreferencebetween text and query.

For example, if we have T: An ogre does not snore and H: An ogre snores, then strictlyspeaking T and H are not contradictory because it is possible that the two sentencesare referring to different ogres. Although the sentence uses an ogre not the ogre, theannotators make the assumption that the ogre in H refers to the ogre in T. In theSICK textual entailment data set, many of the pairs that annotators have labeled ascontradictions are only contradictions if we assume that some expressions corefer acrossT and H.

780


For these examples, here are the logical formulas with coreference in the updated¬H:

T : ∃x. ogre(x) ∧ ¬(∃y. agent(y, x) ∧ snore(y))

Skolemized T : ogre(O) ∧ ¬(∃y. agent(y, O) ∧ snore(y))

H : ∃x, y. ogre(x) ∧ agent(y, x) ∧ snore(y)

¬H : ¬∃x, y. ogre(x) ∧ agent(y, x) ∧ snore(y)

updated ¬H : ¬∃y. ogre(O) ∧ agent(y, O) ∧ snore(y)

Notice how the constant O representing the ogre in T is used in the updated ¬H insteadof the quantified variable x.

We use a rule-based approach to determining coreference between T and H, consid-ering both coreference between entities and coreference of events. Two items (entitiesor events) corefer if they (1) have different polarities, and (2) share the same lemmaor share an inference rule. Two items have different polarities in T and H if one ofthem is embedded under a negation and the other is not. For the example here, ogrein T is not negated, and ogre in ¬H is negated, and both words are the same, so theycorefer.

A pair of items in T and H under different polarities can also corefer if they sharean inference rule. In the example of T: A monster does not snore and H: An ogre snores, weneed monster and ogre to corefer. For cases like this, we rely on the inference rules foundusing the modified Robinson resolution method discussed in Section 5.1. In this case, itdetermines that monster and ogre should be aligned, so they are marked as coreferring.Here is another example: T: An ogre loves a princess, H: An ogre hates a princess. In thiscase, loves and hates are marked as coreferring.

4.5 Using Multiple Parses

In our framework that uses probabilistic inference followed by a classifier that learnsthresholds, we can easily incorporate multiple parses to reduce errors due to mispars-ing. Parsing errors lead to errors in the logical form representation, which in turn canlead to erroneous entailments. If we can obtain multiple parses for a text T and queryH, and hence multiple logical forms, this should increase our chances of getting a goodestimate of the probability of H given T.

The default CCG parser that Boxer uses is C&C (Clark and Curran 2004). Thisparser can be configured to produce multiple ranked parses (Ng and Curran 2012);however, we found that the top parses we get from C&C are usually not diverse enoughand map to the same logical form. Therefore, in addition to the top C&C parse, we usethe top parse from another recent CCG parser, EasyCCG (Lewis and Steedman 2014).

Therefore, for a natural language text NT and query NH, we obtain two parseseach, say ST1 and ST2 for T and SH1 and SH2 for H, which are transformed to logicalforms T1, T2, H1, H2. We now compute probabilities for all possible combinations ofrepresentations of NT and NH: the probability of H1 given T1, the probability of H1given T2, and conversely also the probabilities of H2 given either T1 or T2. If the taskis textual entailment with the three categories: entailment, neutral, and contradiction,then, as described in Section 4.1, we also compute the probability of ¬H1 given eitherT1 or T2, and the probability of ¬H12 given either T1 or T2. When we use multipleparses in this manner, the thresholding classifier is simply trained to take in all of these

781


probabilities as features. In Section 7, we evaluate using C&C alone and using bothparsers.

5. Knowledge Base Construction

This section discusses the automated construction of the knowledge base, which in-cludes the use of distributional information to predict lexical and phrasal entailment.This section integrates two aims that are conflicting to some extent, as alluded to inthe Introduction. The first is to show that a general-purpose in-depth natural languageunderstanding system based on both logical form and distributional representationscan be adapted to perform the RTE task well enough to achieve state-of-the-art re-sults. To achieve this aim, we build a classifier for lexical and phrasal entailmentthat includes many task-specific features that have proven effective in state-of-the-art systems (Marelli et al. 2014a; Bjerva et al. 2014; Lai and Hockenmaier 2014). Thesecond aim is to provide a framework in which we can test different distributionalapproaches on the task of lexical and phrasal entailment as a building block in a generaltextual entailment system. To achieve this second aim, in Section 7 we provide an in-depth ablation study and error analysis for the effect of different types of distributionalinformation within the lexical and phrasal entailment classifier.

Because the biggest computational bottleneck for MLNs is the creation of thenetwork, we do not want to add a large number of inference rules blindly to a giventext/query pair. Instead, we first examine the text and query to determine inferencerules that are potentially useful for this particular entailment problem. For pre-existingrule collections, we add all possibly matching rules to the inference problem (Sec-tion 5.3). For more flexible lexical and phrasal entailment, we use the text/query pairto determine additionally useful inference rules, then automatically create and weightthese rules. We use a variant of Robinson resolution (Robinson 1965) to compute thelist of useful rules (Section 5.1), then apply a lexical and phrasal entailment classifier(Section 5.2) to weight them.

Ideally, the weights that we compute for inference rules should depend on thecontext in which the words appear. After all, the ability to take context into accountin a flexible fashion is one of the biggest advantages of distributional models. Un-fortunately the textual entailment data that we use in this article does not lend itselfto contextualization—polysemy just does not play a large role in any of the existingRTE data sets that we have used so far. Therefore, we leave this issue to future work.

5.1 Robinson Resolution for Alignment and Rule Extraction

To avoid undo complexity in the MLN, we only want to add inference rules specificto a given text T and query H. Earlier versions of our system generated distributionalrules matching any word or short phrase in T with any word or short phrase in H.This includes many unnecessary rules, for example for T: An ogre loves a princess and H:A monster likes a lady, the system generates rules linking ogre to lady. In this article, weuse a novel method to generate only rules directly relevant to T and H: We assume thatT entails H, and ask what missing rule set KB is necessary to prove this entailment. Weuse a variant of Robinson resolution (Robinson 1965) to generate this KB. Another wayof viewing this technique is that it generates an alignment between words and phrasesin T and words or phrases in H guided by the logic.

782


Modified Robinson Resolution. Robinson resolution is a theorem-proving method fortesting unsatisfiability that has been used in some previous RTE systems (Bos 2009). Itassumes a formula in conjunctive normal form (CNF), a conjunction of clauses, wherea clause is a disjunction of literals, and a literal is a negated or non-negated atom. Moreformally, the formula has the form ∀x1, . . . , xn

(C1 ∧ . . . ∧ Cm), where Cj is a clause and it

has the form L1 ∨ . . . ∨ Lk where Li is a literal, which is an atom ai or a negated atom ¬ai.The resolution rule takes two clauses containing complementary literals, and produces anew clause implied by them. Writing a clause C as the set of its literals, we can formulatethe rule as:

C1 ∪ {L1} C2 ∪ {L2}(C1 ∪ C2)θ

where θ is a most general unifier of L1 and ¬L2.In our case, we use a variant of Robinson resolution to remove the parts of text

T and query H that the two sentences have in common. Instead of one set of clauses,we use two: one is the CNF of T, the other is the CNF of ¬H. The resolution rule is onlyapplied to pairs of clauses where one clause is from T, the other from H. When no furtherapplications of the resolution rule are possible, we are left with remainder formulas rTand rH. If rH contains the empty clause, then H follows from T without inference rules.Otherwise, inference rules need to be generated. In the simplest case, we form a singleinference rule as follows. All variables occurring in rT or rH are existentially quantified,all constants occurring in rT or rH are un-skolemized to new universally quantifiedvariables, and we infer the negation of rH from rT. That is, we form the inference rule

∀x1 . . . xn∃y1 . . . ym. rTθ⇒ ¬rHθ

where {y1 . . . ym} is the set of all variables occurring in rT or rH, {a1, . . . an} is the setof all constants occurring in rT or rH and θ is the inverse of a substitution θ : {a1 →x1, . . . , an → xn} for distinct variables x1, . . . , xn.

For example, consider T: An ogre loves a princess and H: A monster loves a princess.This gives us the following two clause sets. Note that all existential quantifiers havebeen eliminated through skolemization. The query is negated, so we obtain five clausesfor T but only one for H.

T : {ogre(A)}, {princess(B)}, {love(C}, {agent(C, A)}, {patient(C, B)}¬H : {¬monster(x),¬princess(y),¬love(z),¬agent(z, x),¬patient(z, y)}

The resolution rule can be applied four times. After that, C has been unified with z(because we have resolved love(C) with love(z)), B with y (because we have resolvedprincess(B) with princess(y)), and A with x (because we have resolved agent(C, A) withagent(z, x)). The formula rT is ogre(A), and rH is ¬monster(A). So the rule that wegenerate is:

∀x.ogre(x)⇒ monster(x)

The modified Robinson resolution thus does two things at once: It removes words thatT and H have in common, leaving the words for which inference rules are needed, andit aligns words and phrases in T with words and phrases in H through unification.

783


One important refinement to this general idea is that we need to distinguishcontent predicates that correspond to content words (nouns, verbs, and adjectives) inthe sentences from non-content predicates such as Boxer’s meta-predicates agent(X, Y).Resolving on non-content predicates can result in incorrect rules—for example, in thecase of T: A person solves a problem and H: A person finds a solution to a problem, in CNF:

T : {person(A)}, {solve(B)}, {problem(C)}, {agent(B, A)}, {patient(B, C)}¬H : {¬person(x),¬find(y),¬solution(z),¬problem(u),¬agent(y, x),¬patient(y, z),

¬to(z, u)}

If we resolve patient(B, C) with patient(y, z), we unify the problem C with the solutionz, leading to a wrong alignment. We avoid this problem by resolving on non-contentpredicates only when they are fully grounded (that is, when the substitution of variableswith constants has already been done by some other resolution step involving contentpredicates).

In this variant of Robinson resolution, we currently do not perform any search, butunify two literals only if they are fully grounded or if the literal in T has a unique literalin H that it can be resolved with, and vice versa. This works for most pairs in the SICKdata set. In future work, we would like to add search to our algorithm, which will helpproduce better rules for sentences with duplicate words.

Rule Refinement. The modified Robinson resolution algorithm gives us one rule pertext/query pair. This rule needs postprocessing, as it is sometimes too short (omittingrelevant context), and often it combines what should be several inference rules.

In many cases, a rule needs to be extended. This is the case when it only showsthe difference between text; and query is too short and needs context to be usableas a distributional rule, for example, in T: A dog is running in the snow, H: A dog isrunning through the snow, the rule we get is ∀x, y. in(x, y)⇒ through(x, y). Although thisrule is correct, it does not carry enough information to compute a meaningful vectorrepresentation for each side. What we would like instead is a rule that infers “runthrough snow” from “run in snow.”

Remember that the variables x and y were Skolem constants in rT and rH, forexample, rT : in(R, S) and rH : through(R, S). We extend the rule by adding the contentwords that contain the constants R and S. In this case, we add the running eventand the snow back in. The final rule is: ∀x, y. run(x) ∧ in(x, y) ∧ snow(y)⇒ run(x) ∧through(x, y) ∧ snow(y).

In some cases, however, extending the rule adds unnecessary complexity. However,we have no general algorithm for when to extend a rule, which would have to takecontext into account. At this time, we extend all rules as described here. As discussednext, the entailment rules subsystem can itself choose to split long rules, and it maychoose to split these extended rules again.

Sometimes, long rules need to be split. A single pair T and H gives rise to one singlepair rT and rH, which often conceptually represents multiple inference rules. So we splitrT and rH as follows. First, we split each formula into disconnected sets of predicates.For example, consider T: The doctors are healing a man, H: The doctor is helping the patient,which leads to the rule ∀x, y. heal(x) ∧man(y)⇒ help(x) ∧ patient(y). The formula rT issplit into heal(x) and man(y) because the two literals do not have any variable in commonand there is no relation (such as agent()) to link them. Similarly, rH is split into help(x)

784


and patient(y). If any of the splits has more than one verb, we split it again, where eachnew split contains one verb and its arguments.

After that, we create new rules that link any part of rT to any part of rH with which ithas at least one variable in common. So, for our example, we obtain ∀x heal(x)⇒ help(x)and ∀y man(y)⇒ patient(y).

There are cases where splitting the rule does not work, for example, with A person,who is riding a bike ⇒ A biker. Here, splitting the rule and using person ⇒ biker losescrucial context information. So we do not perform those additional splits at the level ofthe logical form, though the entailment rules subsystem may choose to do further splits.

Rules as Training Data. The output from the previous steps is a set of rules {r1, ..., rn}for each pair T and H. One use of these rules is to test whether T probabilisticallyentails H. But there is a second use, too: The lexical and phrasal entailment classifier thatwe describe below is a supervised classifier, which needs training data. So we use thetraining part of the SICK data set to create rules through modified Robinson resolution,which we then use to train the lexical and phrasal entailment classifier. For simplicity,we translate the Robinson resolution rules into textual rules by replacing each Boxerpredicate with its corresponding word.

Computing inference-rule training data from RTE data requires deriving labels forindividual rules from the labels on RTE pairs (entailment, contradiction, and neutral).The entailment cases are the most straightforward. Knowing that T ∧ r1 ∧ ... ∧ rn ⇒ H,then it must be that all ri are entailing. We automatically label all ri of the entailing pairsas entailing rules.

For neutral pairs, we know that T ∧ r1 ∧ ... ∧ rn ; H, so at least one of the ri is non-entailing. We experimented with automatically labeling all ri as non-entailing, but thatadds much noise to the training data. For example, if T: A man is eating an apple and H: Aguy is eating an orange, then the rule man⇒ guy is entailing, but the rule apple⇒ orangeis non-entailing. So we automatically compare the ri from a neutral pair to the entailingrules derived from entailing pairs. All rules ri found among the entailing rules fromentailing pairs are assumed to be entailing (unless n = 1, that is, unless we only haveone rule), and all other rules are assumed to be non-entailing. We found that this stepimproved the accuracy of our system. To further improve the accuracy, we performed amanual annotation of rules derived from neutral pairs, focusing only on the rules thatdo not appear in entailing. We labeled rules as either entailing or non-entailing. Fromaround 5,900 unique rules, we found 737 to be entailing. In future work, we plan to usemultiple instance learning (Dietterich, Lathrop, and Lozano-Perez 1997; Bunescu andMooney 2007) to avoid manual annotation; we discuss this further in Section 8.

For contradicting pairs, we make a few simplifying assumptions that fit almost allsuch pairs in the SICK data set. In most of the contradiction pairs in SICK, one of the twosentences T or H is negated. For pairs where T or H has a negation, we assume that thisnegation is negating the whole sentence, not just a part of it. We first consider the casewhere T is not negated, and H = ¬Sh. As T contradicts H, it must hold that T⇒ ¬H,so T⇒ ¬¬Sh, and hence T⇒ Sh. This means that we just need to run our modifiedRobinson resolution with the sentences T and Sh and label all resulting ri as entailing.

Next we consider the case where T = ¬St and H is not negated. As T contradictsH, it must hold that ¬St ⇒ ¬H, so H⇒ St. Again, this means that we just need to runthe modified Robinson resolution with H as the “Text” and St as the “Hypothesis” andlabel all resulting ri as entailing.

The last case of contradiction is when both T and H are not negated, for example,T: T: A man is jumping into an empty pool, H: A man is jumping into a full pool, where

785


empty and full are antonyms. As before, we run the modified Robinson resolution withT and H and obtain the resulting ri. Similar to the neutral pairs, at least one of the ri is acontradictory rule, whereas the rest could be entailing or contradictory rules. As for theneutral pairs, we take a rule ri to be entailing if it is among the entailing rules derivedso far. All other rules are taken to be contradictory rules. We did not do the manualannotation for these rules because they are few.

5.2 The Lexical and Phrasal Entailment Rule Classifier

After extracting lexical and phrasal rules using our modified Robinson resolution(Section 5.1), we use several combinations of distributional information and lexicalresources to build a lexical and phrasal entailment rule classifier (entailment ruleclassifier for short) for weighting the rules appropriately. These extracted rules createan especially valuable resource for testing lexical entailment systems, as they contain avariety of entailment relations (hypernymy, synonymy, antonymy, etc.), and are actuallyuseful in an end-to-end RTE system.

We describe the entailment rule classifier in multiple parts. In Section 5.2.1, weoverview a lexical entailment rule classifier, which only handles single words.Section 5.2.2 describes the lexical resources used. In Section 5.2.3, we describe howour previous work in supervised hypernymy detection is used in the system. InSection 5.2.4, we describe the approaches for extending the classifier to handle phrases.

5.2.1 Lexical Entailment Rule Classifier. We begin by describing the lexical entailment ruleclassifier, which only predicts entailment between single words, treating the task as asupervised classification problem, given the lexical rules constructed from the modifiedRobinson resolution as input. We use numerous features that we expect to be predictiveof lexical entailment. Many were previously shown to be successful for the SemEval2014 Shared Task on lexical entailment (Marelli et al. 2014a; Bjerva et al. 2014; Lai andHockenmaier 2014). Altogether, we use four major groups of features, as summarizedin Table 1 and described in detail here.

Wordform Features. We extract a number of simple features based on the usage of theLHS and RHS in their original sentences. We extract features for whether the LHS andRHS have the same lemma, same surface form, same POS, which POS tags they have,and whether they are singular or plural. Plurality is determined from the POS tags.

WordNet Features. We use WordNet 3.0 to determine whether the LHS and RHS haveknown synonymy, antonymy, hypernymy, or hyponymy relations. We disambiguatebetween multiple synsets for a lemma by selecting the synsets for the LHS and RHSthat minimize their path distance. If no path exists, we choose the most common synsetfor the lemma. Path similarity, as implemented in the Natural Language Toolkit (Bird,Klein, and Loper 2009), is also used as a feature.

Distributional Features. We measure distributional similarity in two distributionalspaces, one which models topical similarity (bag of words; BoW), and one which modelssyntactic similarity (Dependency; Dep). We use cosine similarity of the LHS and RHS inboth spaces as features.

One very important feature set used from distributional similarity is the histogrambinning of the cosines. We create 12 additional binary, mutually exclusive features, whichmark whether the distributional similarity is within a given range. We use the ranges ofexactly 0, exactly 1, 0.01–0.09, 0.10–0.19, . . . , 0.90–0.99. Figure 4 shows the importanceof these histogram features: Words that are very similar (0.90–0.99) are much less likely

786


Table 1List of features in the lexical entailment classifier, along with types and counts.

Name Description Type #

Wordform 18

Same word Same lemma, surface form Binary 2POS POS of LHS, POS of RHS, same POS Binary 10Sg/Pl Whether LHS/RHS/both are singular/plural Binary 6

WordNet 18

OOV True if a lemma is not in WordNet, or no path exists Binary 1Hyper True if LHS is hypernym of RHS Binary 1Hypo True if RHS is hypernym of LHS Binary 1Syn True if LHS and RHS is in same synset Binary 1Ant True if LHS and RHS are antonyms Binary 1Path Sim Path similarity (NLTK) Real 1Path Sim Hist Bins of path similarity (NLTK) Binary 12

Distributional features (Lexical) 28

OOV True if either lemma not in dist space Binary 2BoW Cosine Cosine between LHS and RHS in BoW space Real 1Dep Cosine Cosine between LHS and RHS in Dep space Real 1BoW Hist Bins of BoW Cosine Binary 12Dep Hist Bins of Dep Cosine Binary 12

Asymmetric Features (Roller, Erk, and Boleda 2014) 600

Diff LHS dep vector − RHS dep vector Real 300DiffSq RHS dep vector − RHS dep vector, squared Real 300

to be entailing than words that are moderately similar (0.70–0.89). This is because themost highly similar words are likely to be cohyponyms.

5.2.2 Preparing Distributional Spaces. As described in the previous section, we use dis-tributional semantic similarity as features for the entailment rules classifier. Here wedescribe the preprocessing steps to create these distributional resources.

Corpus and Preprocessing. We use the BNC, ukWaC, and a 2014-01-07 copy ofWikipedia. All corpora are preprocessed using the Stanford CoreNLP parser. Wecollapse particle verbs into a single token, and all tokens are annotated with a (short)POS tag so that the same lemma with a different POS is modeled separately. We keeponly content words (NN, VB, RB, JJ) appearing at least 1,000 times in the corpus. Thefinal corpus contains 50,984 types and roughly 1.5B tokens.

Bag-of-Words Vectors. We filter all but the 51k chosen lemmas from the corpus, andcreate one sentence per line. We use Skip-Gram Negative Sampling to create vectors(Mikolov et al. 2013). We use 300 latent dimensions, a window size of 20, and 15negative samples. These parameters were not tuned, but chosen as reasonable defaultsfor the task. We use the large window size to ensure the BoW vectors captured moretopical similarity, rather than syntactic similarity, which is modeled by the dependencyvectors.

787


Figure 4Distribution of entailment relations on lexical items by cosine. Highly similar pairs (0.90–0.99)are less likely entailing than moderately similar pairs (0.70–0.89).

Dependency Vectors. We extract (lemma/POS, relation, context/POS) tuples from eachof the Stanford Collapsed CC Dependency graphs. We filter tuples with lemmas not inour 51k chosen types. Following Baroni and Lenci (2010), we model inverse relationsand mark them separately. For example, “red/JJ car/NN” will generate tuples for both(car/NN, amod, red/JJ) and (red/JJ, amod−1, car/NN). After extracting tuples, we discard allbut the top 100k (relation, context/POS) pairs and build a vector space using lemma/POSas rows, and (relation, context/POS) as columns. The matrix is transformed with PositivePointwise Mutual Information, and reduced to 300 dimensions using Singular ValueDecomposition (SVD). We do not vary these parameters, but chose them as they per-formed best in prior work (Roller, Erk, and Boleda 2014).

5.2.3 Asymmetric Entailment Features. As an additional set of features, we also use therepresentation previously utilized by the asymmetric, supervised hypernymy classifierdescribed by Roller, Erk, and Boleda (2014). Previously, this classifier was only usedon artificial data sets, which encoded specific lexical relations, like hypernymy, co-hyponymy, and meronymy. Here, we use its representation to encode just the threegeneral relations: entailment, neutral, and contradiction.

The asymmetric features take inspiration from Mikolov, Yih, and Zweig (2013), whofound that differences between distributional vectors often encode certain linguisticregularities, like ~king− ~man + ~woman ≈ ~queen. In particular the asymmetric classifieruses two sets of features, < f, g >, where:

fi(LHS, RHS) = ~LHSi − ~RHSi

gi(LHS, RHS) = f 2i

that is, the vector difference between the LHS and the RHS, and this difference vectorsquared. Both feature sets are extremely important to strong performance.

For these asymmetric features, we use the Dependency space described earlier. Wechoose the Dep space because we previously found that spaces reduced using SVD

788


outperform word embeddings generated by the Skip-gram procedure. We do not useboth spaces, because of the large number of features this creates.

Recently, there has been considerable work in detecting lexical entailments usingonly distributional vectors. The classifiers proposed by Fu et al. (2014), Levy et al.(2015), and Kruszewski, Paperno, and Baroni (2015) could have also been used in placeof these asymmetric features, but we reserve evaluations of these models for futurework.

5.2.4 Extending Lexical Entailment to Phrases. The lexical entailment rule classifier de-scribed in previous sections is limited to only simple rules, where the LHS and RHSare both single words. Many of the rules generated by the modified Robinson res-olution are actually phrasal rules, such as little boy → child, or running → movingquickly. In order to model these phrases, we use two general approaches: First, weuse a state-of-the-art compositional model, in order to create vector representationsof phrases, and then include the same similarity features described in the previoussection. The full details of the compositional distributional model are described inSection 5.2.5.

In addition to a compositional distributional model, we also used a simple, greedyword aligner, similar to the one described by Lai and Hockenmaier (2014). This alignerworks by finding the pair of words on the LHS and RHS that are most similar in adistributional space, and marking them as “aligned.” The process is repeated untilat least one side is completely exhausted. For example, “red truck → big blue car,”we would align “truck” with “car” first, then “red” with “blue,” leaving “big” un-aligned.

After performing the phrasal alignment, we compute a number of base features,based on the results of the alignment procedure. These include values like the lengthof the rule, the percent of words unaligned, and so forth. We also compute all ofthe same features used in the lexical entailment rule classifier (Wordform, WordNet,Distributional) and compute their min/mean/max across all the alignments. We do notinclude the asymmetric entailment features as the feature space then becomes extremelylarge. Table 2 contains a listing of all phrasal features used.

5.2.5 Phrasal Distributional Semantics. We build phrasal distributional space based onthe practical lexical function model of Paperno, Pham, and Baroni (2014). We again useas the corpus a concatenation of BNC, ukWaC, and English Wikipedia, parsed withthe Stanford CoreNLP parser. We focus on five types of dependency labels (amod,nsubj, dobj, pobj, acomp) and combine the governor and dependent words of thesedependencies to form adjective–noun, subject–verb, verb–object, preposition–noun, andverb–complement phrases respectively. We only retain phrases where both the governorand the dependent are among the 50K most frequent words in the corpus, resulting in1.9 million unique phrases. The co-occurrence counts of the 1.9 million phrases withthe 20K most frequent neighbor words in a 2-word window are converted to a PositivePointwise Mutual Information matrix, and reduced to 300 dimensions by performingSVD on a lexical space and applying the resulting representation to the phrase vectors,normalized to length 1.

Paperno, Pham, and Baroni (2014) represent a word as a vector, which representsthe contexts in which the word can appear, along with a number of matrices, one foreach type of dependent that the word can take. For a transitive verb like chase, this

789


Table 2Features used in the Phrasal Entailment Classifier, along with types and counts.

Name Description Type #

Base 9

Length Length of rules Real 2Length Diff Length of LHS - length of RHS Real 1Aligned Number of alignments Real 1Unaligned Number of unaligned words on LHS, RHS Real 2Pct aligned Percentage of words aligned Real 1Pct unaligned Percentage of words unaligned on LHS, RHS Real 2

Distributional features (Paperno, Pham, and Baroni 2014) 16

Cosine Cosine between mean constituent vectors Real 1Hist Bins of cosine between mean constituent vectors Binary 12Stats Min/mean/max between constituent vectors Real 3

Lexical features of aligned words 192

Wordform Min/mean/max of each Wordform feature 54WordNet Min/mean/max of each WordNet feature 54Distributional Min/mean/max of each Distributional feature 84

would be one matrix for subjects, and one for direct objects. The representation of thephrase chases dog is then

~chase +�o

chase× ~dog

where × is matrix multiplication, and when the phrase is extended with cat to form cat

chases dog, the representation is

~chase +�s

chase× ~cat + ( ~chase +�o

chase× ~dog)

For verbs, the practical lexical function model trains a matrix for each of the relationsnsubj, dobj, and acomp, for adjectives a matrix for amod, and for prepositions a matrix forpobj. For example, the amod matrix of the adjective “red/JJ” is trained as follows. Wecollect all phrases in which “red/JJ” serves as adjective modifier (assuming the numberof such phrases is N), like “red/JJ car/NN,” “red/JJ house/NN,” and so on, andconstruct two 300×N matrices Marg and Mph, where the ith column of Marg is the vector

of the noun modified by “red/JJ” in the ith phrase (−→car,−−−→house, etc.), and the ith column of

Mph is the vector of phrase i minus the vector of “red/JJ” (−−−→red car−

−→red,−−−−−→red house−

−→red,

etc.), normalized to length 1. Then the amod matrix red�(amod) ∈ R300×300 of “red/JJ”can be computed via ridge regression. Given trained matrices, we compute thecomposition vectors by applying the functions recursively starting from the lowestdependency.

As discussed earlier, some of the logical rules from Section 5.1 need to be splitinto multiple rules. We use the dependency parse to split long rules by iteratively

790


searching for the highest nodes in the dependency tree that occur in the logical rule, andidentifying the logical rule words that are its descendants in phrases that the practicallexical functional model can handle. After splitting, we perform greedy alignment onphrasal vectors to pair up rule parts. Similar to Section 5.2.4, we iteratively identify thepair of phrasal vectors on the LHS and RHS that have the highest cosine similarity untilone side has no more phrases.

5.3 Precompiled Rules

The second group of rules is collected from existing databases. We collect rulesfrom WordNet (Princeton University 2010) and the paraphrase collection PPDB(Ganitkevitch, Van Durme, and Callison-Burch 2013). We use simple string matchingto find the set of rules that are relevant to a given text/query pair T and H. If theleft-hand side of a rule is a substring of T and the right-hand is a substring of H,the rule is added, and likewise for rules with LHS in H and RHS in T. Rules thatgo from H to T are important in case T and H are negated (e.g., T: No ogre likes aprincess, H: No ogre loves a princess). The rule needed is love ⇒ like, which goes fromH to T.

WordNet. WordNet (Princeton University 2010) is a lexical database of wordsgrouped into sets of synonyms. In addition to grouping synonyms, it lists semanticrelations connecting groups. We represent the information on WordNet as “hard” logicalrules. The semantic relations we use are:r Synonymy: ∀x.man(x)⇔ guy(x)r Hypernymy: ∀x. car(x)⇒ vehicle(x)r Antonymy: ∀x. tall(x)⇔ ¬short(x)

One advantage of using logic is that it is a powerful representation that can effectivelyrepresent these different semantic relations.

Paraphrase Collections. Paraphrase collections are precompiled sets of rules (e.g., aperson riding a bike ⇒ a biker). We translate paraphrase collections, in this casePPDB (Ganitkevitch, Van Durme, and Callison-Burch 2013), to logical rules. We use theLexical, One-To-Many, and Phrasal sections of the XL version of PPDB.

We use a simple rule-based approach to translate natural-language rules to logic.First, we can make the assumption that the translation of a PPDB rule is going to be aconjunction of positive atoms. PPDB does contain some rules that are centrally aboutnegation, such as deselected⇒ not selected, but we skip those as the logical form analysisalready handles negation. As always, we want to include in KB only rules pertainingto a particular text/query pair T and H. Say LHS⇒ RHS is a rule such that LHS is asubstring of T and RHS is a substring of H. Then each word in LHS gets representedby a unary predicate applied to a variable, and likewise for RHS—note that we canexpect the same predicates to appear in the logical forms L(T ) and L(H ) of the text andquery. For example, if the rule is a person riding a bike ⇒ a biker, then we obtain theatoms person(p), riding(r), and bike(b) for the LHS, with variables p, r, b. We then addBoxer meta-predicates to the logical form for LHS, and likewise for RHS. Say that L(T )includes person(A) ∧ ride(B) ∧ bike(C) ∧ agent(B, A) ∧ patient(B, C) for constants A, B,and C, then we extend the logical form for LHS with agent(r, p) ∧ patient(r, b). We proceed

791


analogously for RHS. This gives us the logical forms: L(LHS) = person(p) ∧ agent(r, p) ∧riding(r) ∧ patient(r, b) ∧ bike(b) and L(RHS) = biker(k).

The next step is to bind the variables in L(LHS) to those in L(RHS). In the ex-ample above, the variable k in the RHS should be matched with the variable p inthe LHS. We determine these bindings using a simple rule-based approach: We man-ually define paraphrase rule templates for PPDB, which specify variable bindings.A rule template is conditioned on the part-of-speech tags of the words involved.In our example it is N1V2N3 ⇒ N1, which binds the variables of the first N on theleft to the first N on the right, unifying the variables p and k. The final paraphraserule is: ∀p, r, b. person(p) ∧ agent(r, p) ∧ riding(r) ∧ patient(r, b) ∧ bike(b)⇒ biker(p). In casesome variables in the RHS remain unbound, they are existentially quantified (e.g.,∀p. pizza(p)⇒ ∃q. slice(p) ∧ of (p, q) ∧ pizza(q)).

Each PPDB rule comes with a set of similarity scores that we need to map to asingle MLN weight. We use the simple log-linear equation suggested by Ganitkevitch,Van Durme, and Callison-Burch (2013) to map the scores into a single value:

weight(r) = −N∑

i=1

λi logϕi (6)

where r is the rule, N is number of the similarity scores provided for the rule r, ϕi is thevalue of the ith score, and λi is its scaling factor. For simplicity, following Ganitkevitch,Van Durme, and Callison-Burch, we set all λi to 1. To map this weight to a final MLNrule weight, we use the weight-learning method discussed in Section 6.3.

Handcoded Rules. We also add a few handcoded rules to the KB that we do notget from other resources. For the SICK data set, we only add several lexical ruleswhere one side of the rule is the word nobody, for example, nobody ⇔ ¬ somebody andnobody⇔ ¬ person.

6. Probabilistic Logical Inference

We now turn to the last of the three main components of our system, probabilistic logicalinference. MLN inference is usually intractable, and using MLN implementations “outof the box” does not work for our application. This section discusses an MLN imple-mentation that supports complex queries and uses the CWA to decrease problem size,hence making inference more efficient. Finally, this section discusses a simple weight-learning scheme to learn global scaling factors for weighted rules in KB from differentsources.

6.1 Complex Formulas as Queries

Current implementations of MLNs like Alchemy (Kok et al. 2005) do not allow queriesto be complex formulas; they can only calculate probabilities of ground atoms. Thissection discusses an inference algorithm for arbitrary query formulas.

The Standard Work-Around. Although current MLN implementations can only cal-culate probabilities of ground atoms, they can be used to calculate the probability of

792


a complex formula through a simple work-around. The complex query formula H isadded to the MLN using the hard formula

H⇔ result(D) | ∞ (7)

where result(D) is a new ground atom that is not used anywhere else in the MLN. Then,inference is run to calculate the probability of result(D), which is equal to the probabilityof the formula H. However, this approach can be very inefficient for the most commonform of queries, which are existentially quantified queries, for example:

H : ∃x, y, z. ogre(x) ∧ agent(y, x) ∧ love(y) ∧ patient(y, z) ∧ princess(z) (8)

Grounding of the backward direction of the double-implication is very problematicbecause the existentially quantified formula is replaced with a large disjunction overall possible combinations of constants for variables x, y, and z (Gogate and Domingos2011). Converting this disjunction to clausal form becomes increasingly intractable asthe number of variables and constants grow.

New Inference Method. Instead, we propose an inference algorithm to directly calcu-late the probability of complex query formulas. In MLNs, the probability of a formula isthe sum of the probabilities of the possible worlds that satisfy it. Gogate and Domingos(2011) show that to calculate the probability of a formula H given a probabilistic knowl-edge base KB, it is enough to compute the partition function Z of KB with and withoutH added as a hard formula:

P(H | KB) =Z(KB ∪ {(H,∞)})

Z(KB) (9)

Therefore, all we need is an appropriate algorithm to estimate the partition function Zof a Markov network. Then, we construct two ground networks, one with the query andone without, and estimate their Zs using that estimator. The ratio between the two Zs isthe probability of H.

We use SampleSearch (Gogate and Dechter 2011) to estimate the partition function.SampleSearch is an importance sampling algorithm that has been shown to be effectivewhen there is a mix of probabilistic and deterministic (hard) constraints, a fundamentalproperty of the inference problems we address. Importance sampling in general isproblematic in the presence of determinism, because many of the generated samplesviolate the deterministic constraints, and they are rejected. Instead, SampleSearch uses abase sampler to generate samples and then uses a backtracking search with a SAT solverto modify the generated sample if it violates the deterministic constraints. We use animplementation of SampleSearch that uses a generalized belief propagation algorithmcalled Iterative Join-Graph Propagation (Dechter, Kask, and Mateescu 2002) as a basesampler. This version is available online (Gogate 2014).

For cases like the example H in Equation (8), we need to avoid generating alarge disjunction because of the existentially quantified variables. So we replace Hwith its negation ¬H, replacing the existential quantifiers with universals, which areeasier to ground and perform inference upon. Finally, we compute the probability ofthe query P(H) = 1− P(¬H). Note that replacing H with ¬H cannot make inferencewith the standard work-around faster, because with ¬H, the direction ¬H⇒ result(D)

793


suffers from the same problem of existential quantifiers that we previously had withH⇐ result(D).

6.2 Inference Optimization Using the Closed-World Assumption

This section explains why our MLN inference problems are computationally difficult,and then explains how the CWA can be used to reduce the problem size and speed upinference. For more details, see Beltagy and Mooney (2014).

In the inference problems we address, formulas are typically long, especially thequery H. The number of ground clauses of a first-order formula is exponential inthe number of variables in the formula; it is O(cv), where c is the number of constantsin the domain and v is the number of variables in the formula. For a moderately longformula, the number of resulting ground clauses is infeasible to process.

We have argued in Section 4.3 that for probabilistic inference problems based onnatural language text/query pairs, it makes sense to make the closed-world assump-tion: If we want to know if the query is true in the situation or setting laid out in thetext, we should take as false anything not said in the text. In our probabilistic setting,the CWA amounts to giving low prior probabilities to all ground atoms unless theycan be inferred from the text and knowledge base. However, we found that a largefraction of the ground atoms cannot be inferred from the text and knowledge base,and their probabilities remain very low. As an approximation, we can assume that thissmall probability is exactly zero and these ground atoms are false, without significantlyaffecting the probability of the query. This will remove a large number of the groundatoms, which will dramatically decrease the size of the ground network and speed upinference.

We assume that all ground atoms are false by default unless they can be inferredfrom the text and the knowledge base T ∧ KB. For example:

T : ogre(O) ∧ agent(S, O) ∧ snore(S)

KB : ∀x. ogre(x)⇒ monster(x)

H : ∃x, y.monster(x) ∧ agent(y, x) ∧ snore(y)

Ground atoms {ogre(O), snore(S), agent(S, O)} are not false because they can be inferredfrom T. Ground atom monster(O) is also not false because it can be inferred from T ∧ KB.All other ground atoms are false.

Here is an example of how this simplifies the query H. H is equivalentto a disjunction of all its possible groundings: H : (monster(O) ∧ agent(S, O) ∧snore(S)) ∨ (monster(O) ∧ agent(O, O) ∧ snore(O)) ∨ (monster(S) ∧ agent(O, S) ∧ snore(O))∨ (monster(S) ∧ agent(S, S) ∧ snore(S)). Setting all ground atoms to false exceptthe inferred ones, then simplifying the expression, we obtain: H : monster(O) ∧agent(S, O) ∧ snore(S). Notice that most ground clauses of H are removed because theyare false. We are left just with the ground clauses that potentially have a non-zeroprobability. Dropping all false ground clauses leaves an exponentially smaller numberof ground clauses in the ground network. Even though the inference problem remainsexponential in principle, the problem is much smaller in practice, such that inferencebecomes feasible. In our experiments with the SICK data set, the number of groundclauses for the query ranges from 0 to 19,209 with mean 6. This shows that the CWAeffectively reduces the number of ground clauses for the query from millions (or even

794


billions) to a manageable number. With the CWA, the average number of inferrableground atoms (ignoring ground atoms from the text) ranges from 0 to 245 with anaverage of 18.

6.3 Weight Learning

We use weighted rules from different sources, both PPDB weights (Section 5.3) andthe confidence of the entailments rule classifier (Section 5.2). These weights are notnecessarily on the same scale—for example, one source could produce systematicallylarger weights than the other. To map them into uniform weights that can be used withinan MLN, we use weight learning. Similar to the work of Zirn et al. (2011), we learn asingle mapping parameter for each source of rules that functions as a scaling factor:

MLNweight = scalingFactor× ruleWeight (10)

We use a simple grid search to learn the scaling factors that optimize performance onthe RTE training data.

Assuming that all rule weights are in [0, 1] (this is the case for classification confi-dence scores, and PPDB weights can be scaled), we also try the following mapping:

MLNweight = scalingFactor× log(ruleWeight

1− ruleWeight ) (11)

This function assures that for an MLN with a single rule LHS⇒ RHS |MLNweight, it isthe case that P(RHS|LHS) = ruleWeight, given that scalingFactor = 1.

7. Evaluation

This section evaluates our system. First, we evaluate several lexical and phrasal dis-tributional systems on the rules that we collected using modified Robinson resolution.This includes an in-depth analysis of different types of distributional information withinthe entailment rule classifier. Second, we use the best configuration we find in the firststep as a knowledge base and evaluate our system on the RTE task using the SICK dataset.

The SICK data set, which is described in Section 2, consists of 5,000 pairs for trainingand 4,927 for testing. Pairs are annotated for RTE and STS tasks. We use the RTEannotations of the data set.

7.1 Evaluating the Entailment Rule Classifier

The entailment rule classifier described in Section 5.2 constitutes a large portion ofthe full system’s end-to-end performance, but consists of many different feature setsproviding different kinds of information. In this section, we thoroughly evaluatethe entailment rule classifier, both quantitatively and qualitatively, to identify theindividual and holistic value of each feature set and systematic patterns. However,this evaluation may also be used as a framework by future lexical semantics researchto see its value in end-to-end textual entailment systems. For example, we couldhave also included features corresponding to the many measures of distributionalinclusion that were developed to predict hypernymy (Weeds, Weir, and McCarthy 2004;

795


Kotlerman et al. 2010; Lenci and Benotto 2012; Santus 2013), or other supervised lexicalentailment classifiers (Baroni et al. 2012; Fu et al. 2014; Weeds et al. 2014; Levy et al.2015; Kruszewski, Paperno, and Baroni 2015).

Evaluation is broken into four parts: first, we overview performance of the entireentailment rule classifier on all rules, both lexical and phrasal. We then break downthese results into performance on only lexical rules and only phrasal rules. Finally, welook at only the asymmetric features to address concerns raised by Levy et al. (2015).In all sections, we evaluate the lexical rule classifier on its ability to generalize to newword pairs, as well as the full system’s performance when the entailment rule classifieris used as the only source of knowledge.

Overall, we find that distributional semantics is of vital importance to the lexicalrule classifier and the end-to-end system, especially when word relations are not explic-itly found in WordNet. The introduction of syntactic distributional spaces and cosinebinning are especially valuable, and greatly improve performance over our own priorwork. Contrary to Levy et al., we find the asymmetric features provide better detectionof hypernymy over memorizing of prototypical hypernyms, but the prototype vectorsbetter capture examples which occur very often in the data; explicitly using both doesbest. Finally, we find, to our surprise, that a state-of-the-art compositional distributionalmethod (Paperno, Pham, and Baroni 2014) yields disappointing performance on phrasalentailment detection, though it does successfully identify non-entailments derivingfrom changing prepositions or semantic roles.

7.1.1 Experimental Set-up. We use the gold standard annotations described in Section 5.1.We perform 10-fold cross-validation on the annotated training set, using the same foldsin all settings. Because some RTE sentence pairs require multiple lexical rules, we ensurethat cross-validation folds are stratified across the sentences, so that the same sentencecannot appear in both training and testing. We use a Logistic Regression classifier withan L2 regularizer.6 Because we perform three-way classification, we train models usingone-vs-all.

Performance is measured in two main metrics. Intrinsic accuracy measures how theclassifier performs in the cross-validation setting on the training data. This correspondsto treating lexical and phrasal entailment as a basic supervised learning problem. RTEaccuracy is accuracy on the end task of textual entailment using the predictions of theentailment rule classifier. For RTE accuracy, the predictions of the entailment rule clas-sifier were used as the only knowledge base in the RTE system. RTE training accuracyuses the predictions from the cross-validation experiment, and for RTE test accuracy theentailment rule classifier was trained on the whole training set.

7.1.2 Overall Lexical and Phrasal Entailment Evaluation. Table 3 shows the results of theEntailment experiments on all rules, both lexical and phrasal. In order to give bounds onour system’s performance, we present baseline score (entailment rule classifier alwayspredicts non-entailing) and ceiling score (entailment rule classifier always predicts goldstandard annotation).

The ceiling score (entailment rule classifier always predicts gold standard annota-tion) does not achieve perfect performance. This is because of a number of different

6 We experimented with multiple classifiers, including Logistic Regression, Decision Trees, and SVMs (withpolynomial, RBF, and linear kernels). We found that linear classifiers performed best, and chose LogisticRegression, because it was used in Roller, Erk, and Boleda (2014) and Lai and Hockenmaier (2014).

796


Table 3Cross-validation accuracy on entailment on all rules.

Feature set Intrinsic RTE Train RTE Test

Always guess neutral 64.3 73.9 73.3Gold standard annotations 100.0 95.0 95.5

Base only 64.3 73.8 73.4Wordform only 67.3 77.0 76.7WordNet only 75.1 81.9 81.3Dist (Lexical) only 71.5 78.7 77.7Dist (Phrasal) only 66.9 75.9 75.1Asym only 70.1 77.3 77.2

All features 79.9 84.0 83.0

issues including misparses, imperfect rules generated by the modified Robinson resolu-tion, a few system inference timeouts, and various idiosyncrasies of the SICK data set.

Another point to note is that WordNet is by far the strongest set of features for thetask. This is unsurprising, as synonymy and hypernymy information from WordNetgives nearly perfect information for much of the task. There are some exceptions, suchas woman 9 man, or black 9 white, which WordNet lists as antonyms, but which arenot considered contradictions in the SICK data set (e.g., “T: A man is cutting a tomato”and “H: A woman is cutting a tomato” is not a contradiction). However, even thoughWordNet has extremely high coverage on this particular data set, it still is far fromexhaustive: About a quarter of the rules have at least one pair of words for whichWordNet relations could not be determined.

The lexical distributional features do surprisingly well on the task, obtaining a testaccuracy of 77.7 (Table 3). This indicates that, even with only distributional similarity,we do well enough to score in the upper half of systems in the original SemEvalshared task, where the median test accuracy of all teams was 77.1 (Marelli et al. 2014a).Two components were critical to the increased performance over our own prior work:first, the use of multiple distributional spaces (one topical, one syntactic); second, thebinning of cosine values. When using only the BoW cosine similarity as a feature,the classifier actually performs below baseline (50.0 intrinsic accuracy [IA]; compareto Table 4). Similarly, only using syntactic cosine similarity as a feature also performspoorly (47.2 IA). However adding binning to either improves performance (64.3 and64.7 for BoW and Dep), and adding binning to both improves it further (68.8 IA, asreported in Table 4).

The phrasal distributional similarity features, which are based on the state-of-the-art Paperno, Pham, and Baroni (2014) compositional vector space, perform somewhatdisappointingly on the task. We discuss possible reasons for this below in Section 7.1.4.

We also note that the Basic Alignment features and WordForm features (describedin Tables 1 and 2) do not do particularly well on their own. This is encouraging, as itmeans the data set cannot be handled by simply expecting the same words to appear onthe LHS and RHS. Finally, we note that the features are highly complementary, and thecombination of all features gives a substantial boost to performance.

7.1.3 Evaluating the Lexical Entailment Rule Classifier. Table 4 shows performance of theclassifier on only the lexical rules, which have single words on the LHS and RHS. In

797


Table 4Cross-validation accuracy on entailment on lexical rules only.



Wordform only 57.4 70.4 70.9WordNet only 79.1 83.1 84.2Dist (Lexical) only 68.8 76.3 76.7Asym only 76.8 78.3 79.2


these experiments we use the same procedure as before, but omit the phrasal rulesfrom the data set. On the RTE tasks, we compute accuracy over only the SICK pairsthat require at least one lexical rule. Note that a new ceiling score is needed, as somerules require both lexical and phrasal predictions, but we do not predict any phrasalrules.

Again we see that WordNet features have the highest contribution. Distributionalrules still perform better than the baseline, but the gap between distributional featuresand WordNet is much more apparent. Perhaps most encouraging is the very high per-formance of the Asymmetric features: by themselves, they perform substantially betterthan just the distributional features. We investigate this further below in Section 7.1.5.

As with the entire data set, we once again see that all the features are highlycomplementary, and intrinsic accuracy is greatly improved by using all the featurestogether. It may be surprising that these significant gains in intrinsic accuracy do nottranslate to improvements on the RTE tasks; in fact, there is a minor drop from usingall features compared to only using WordNet. This most likely depends on which pairsthe system gets right or wrong. For sentences involving multiple lexical rules, errorsbecome disproportionately costly. As such, the high-precision WordNet predictions areslightly better on the RTE task.

In a qualitative analysis comparing a classifier with only cosine distributional fea-tures to a classifier with the full feature set, we found that, as expected, the distributionalfeatures miss many hypernyms and falsely classify many co-hyponyms as entailing:We manually analyzed a sample of 170 pairs that the distributional classifier falselyclassifies as entailing. Of these, 67 were co-hyponyms (39%), 33 were antonyms (19%),and 32 were context-specific pairs like stir/fry. On the other hand, most (87%) casesof entailment that the distributional classifier detects but the all-features classifiermisses are word pairs that have no link in WordNet. These pairs include note→ paper,swimmer→ racer, eat→ bite, and stand→ wait.

7.1.4 Evaluating the Phrasal Entailment Rule Classifier. Table 5 shows performance whenlooking at only the phrasal rules. As with the evaluation of lexical rules, we evaluate theRTE tasks only on sentence pairs that use phrasal rules, and do not provide any lexicalinferences. As such, the ceiling score must again be recomputed.

We first notice that the phrasal subset is generally harder than the lexical subset:none of the feature sets on their own provide dramatic improvements over the baseline,or come particularly close to the ceiling score. On the other hand, using all features

798


Table 5Cross-validation accuracy on entailment on phrasal rules only.



Base only 68.3 73.3 73.6Wordform only 72.5 77.1 77.1WordNet only 73.9 78.3 77.7Dist (Lexical) only 72.9 77.0 76.5Dist (Phrasal) only 71.9 75.7 75.3


together does better than any of the feature groups by themselves, indicating again thatthe feature groups are highly complementary.

Distributional features perform rather close to the Wordform features, suggestingthat possibly the Distributional features may simply be proxies for the same lemmaand same POS features. A qualitative analysis comparing the predictions of Wordformand Distributional features shows otherwise though: the Wordform features are bestat correctly identifying non-entailing phrases (higher precision), and the distributionalfeatures are best at correctly identifying entailing phrases (higher recall).

As with the full data set, we see that the features based on Paperno, Pham, andBaroni do not perform as well as just the alignment-based distributional lexical features;in fact, they do not perform even as well as features that make predictions using onlyWordform features. We qualitatively compare the Paperno, Pham, and Baroni features(or phrasal features for short) to the features based on word similarity of greedilyaligned words (or alignment features). We generally find the phrase features are muchmore likely to predict neutral, and the alignment-based features are much more likely topredict entailing. In particular, the phrasal vectors seem to be much better at capturingnon-entailment based on differences in prepositions (walk inside building 9 walk outsidebuilding), additional modifiers on the RHS (man 9 old man, room 9 darkened room), andchanging semantic roles (man eats near kitten 9 kitten eats). Surprisingly, we found thelexical distributional features were better at capturing complex paraphrases, such asteenage→ in teens, ride bike→ biker, or young lady→ teenage girl.

7.1.5 Evaluating the Asymmetric Classifier. Levy et al. (2015) show several experimentssuggesting that asymmetric classifiers do not perform better at the task of identifyinghypernyms than when the RHS vectors alone are used as features. That is, they findthat the asymmetric classifier and variants frequently learn to identify prototypicalhypernyms rather than the hypernymy relation itself. We look at our data in the light ofthe Levy et al. study, particularly as none of the entailment problem sets used by Levyet al. were derived from an existing RTE data set.

In a qualitative analysis comparing the predictions of a classifier using onlyAsymmetric features with a classifier using only cosine similarity, we found that theAsymmetric classifier does substantially better at distinguishing hypernymy from co-hyponymy. This is what we had hoped to find, as we had previously found an Asym-metric classifier to perform well at identifying hypernymy in other data (Roller, Erk,

799


and Boleda 2014), and cosine is known to heavily favor co-hyponymy (Baroni and Lenci2011). However, we also find that cosine features are better at discovering synonymy,and that Asymmetric frequently mistakes antonymy as an entailing. We did a quantita-tive analysis comparing the predictions of a classifier using only Asymmetric featuresto a classifier that tries to learn typical hyponyms or hypernyms by using only the LHSvectors, or the RHS vectors, or both. Table 6 shows the results of these experiments.

Counter to the main findings of Levy et al. (2015), we find that there is at leastsome learning of the entailment relationship by the asymmetric classifier (particularlyon the intrinsic evaluation), as opposed to the prototypical hypernym hypothesis. Webelieve this is because the data set is too varied to allow the classifier to learn whatan entailing RHS looks like. Indeed, a qualitative analysis shows that the asymmetricfeatures successfully predict many hypernyms that RHS vectors miss. On the otherhand, the RHS do manage to capture particular semantic classes, especially on wordsthat appear many times in the data set, like cut, slice, man, cliff, and weight.

The classifier given both the LHS and RHS vectors dramatically outperforms itscomponents: It is given freedom to nearly memorize rules that appear commonly inthe data. Still, using all three sets of features (Asym + LHS + RHS) is most powerfulby a substantial margin. This feature set is able to capture the frequently occurringitems, while also allowing some power to generalize to novel entailments. For example,by using all three we are able to capture some additional hypernyms (beer → drink,pistol → gun) and synonyms (couch → sofa, throw →, hurl), as well as some more diffi-cult entailments (hand→ arm, young→ little).

Still, there are many ways our lexical classifier could be improved, even using all ofthe features in the system. In particular, it seems to do particularly poorly on antonyms(strike 9 miss), and items that require additional world knowledge (surfer→man). It alsooccasionally misclassifies some co-hyponyms (trumpet 9 guitar) or gets the entailmentdirection wrong (toy 9 ball).

7.2 RTE Task Evaluation

This section evaluates different components of the system, and finds a configuration ofour system that achieves state-of-the-art results on the SICK RTE data set.

We evaluate the following system components. The component logic is our basicMLN-based logic system that computes two inference probabilities (Section 4.1). This

Table 6Cross-validation accuracy on entailment on lexical rules for Asym evaluation.



Asym only 76.8 78.3 79.2

LHS only 65.4 73.8 73.5RHS only 73.2 78.6 79.9LHS + RHS 76.4 79.8 80.6

Asym + LHS + RHS 81.4 81.4 82.6

800


includes the changes to the logical form to handle the domain closure assumption(Section 4.2), the inference algorithm for query formulas (Section 6.1), and the in-ference optimization (Section 6.2). The component cws deals with the problem thatthe closed-world assumption raises for negation in the hypothesis (Section 4.3), andcoref is coreference resolution to identify contradictions (Section 4.4). The componentmultiparse signals the use of two parsers, the top C&C parse and the top EasyCCGparse (Section 4.5).

The remaining components add entailment rules. The component eclassif addsthe rules from the best performing entailment rule classifier trained in Section 7.1.This is the system with all features included. The ppdb component adds rules fromPPDB paraphrase collection (Section 5.3). The wlearn component learns a scaling factorfor ppdb rules, and another scaling factor for the eclassif rules that maps the clas-sification confidence scores to MLN weights (Section 6.3). Without weight learning,the scaling factor for ppdb is set to 1, and all eclassif rules are used as hard rules(infinite weight). The wlearn log component is similar to wlearn but uses Equation(11), which first transforms a rule weight to its log odds. The wn component adds rulesfrom WordNet (Section 5.3). In addition, we have a few handcoded rules (Section 5.3).Like wn, the components hyp and mem repeat information that is used as featuresfor entailment rules classification but is not always picked up by the classifier. As theclassifier sometimes misses hypernyms, hyp marks all hypernymy rules as entailing (sothis component is subsumed by wn), as well as all rules where the left-hand side andthe right-hand side are the same. (The latter step becomes necessary after splitting longrules derived by our modified Robinson resolution; some of the pieces may have equalleft-hand and right-hand sides.) The mem component memorizes all entailing rules seenin the training set of eclassif.

Sometimes inference takes a long time, so we set a 2-minute timeout for eachinference run. If inference does not finish processing within the time limit, we terminatethe process and return an error code. About 1% of the data set times out.

7.2.1 Ablation Experiment without eclassif. Because eclassif has the most impact on thesystem’s accuracy, and when enabled suppresses the contribution of the other compo-nents, we evaluate the other components first without eclassif. In the following section,we add the eclassif rules. Table 7 summarizes the results of this experiment. The resultsshow that each component plays a role in improving the system accuracy. Our bestaccuracy without eclassif is 80.4%. Without handling the problem of negated hypothe-ses (logic alone), P(¬H|T) is almost always 1 and this additional inference becomes use-less, resulting in an inability to distinguish between neutral and contradiction. Addingcwa significantly improves accuracy because the resulting system has P(¬H|T) equal to1 only for contradictions.

Each rule set (ppdb, wn, handcoded) improves accuracy by reducing the numberof false negatives. We also note that applying weight learning (wlearn) to find a globalscaling factor for PPDB rules makes them more useful. The learned scaling factor is3.0. When the knowledge base is lacking other sources, weight learning assigns a highscaling factor to PPDB, giving it more influence throughout. When eclassif is addedin the following section, weight learning assigns PPDB a low scaling factor becauseeclassif already includes a large set of useful rules, such that only the highest weightedPPDB rules contribute significantly to the final inference.

801


Table 7Ablation experiment for the system components without eclassif.

Components Enabled Train Acc. Test Acc.

logic 63.2 63.5+ cwa 72.1 71.7+ cwa + coref 73.8 73.4+ cwa + coref + ppdb 75.3 74.8+ cwa + coref + ppdb + wlearn 76.5 76.3+ cwa + coref + ppdb + wlearn + wn 78.8 78.4+ cwa + coref + ppdb + wlearn + wn + handcoded 79.2 78.8+ cwa + coref + ppdb + wlearn + wn + handcoded + multiparse 80.8 80.4

The last component tested is the use of multiple parses (multiparse). Many of thefalse negatives are due to misparses. Using two different parses reduces the impact ofthe misparses, improving the system accuracy.

7.2.2 Ablation Experiment with eclassif. In this experiment, we first use eclassif as a knowl-edge base, then incrementally add the other system components. Table 8 summarizesthe results. First, we note that adding eclassif to the knowledge base KB significantlyimproves the accuracy from 73.4% to 83.0%. This is higher than what ppdb and wnachieved without eclassif. Adding handcoded still improves the accuracy somewhat.

Adding multiparse improves accuracy, but interestingly, not as much as in theprevious experiment (without eclassif). The improvement on the test set decreases from1.6% to 0.7%. Therefore, the rules in eclassif help reduce the impact of misparses. Hereis an example to show how: T: An ogre is jumping over a wall, H: An ogre is jumping overthe fence which in logic are:

T: ∃x, y, z. ogre(x) ∧ agent(y, x) ∧ jump(y) ∧ over(y, z) ∧ wall(z)

H: ∃x, y, z. ogre(x) ∧ agent(y, x) ∧ jump(y) ∧ over(y) ∧ patient(y, z) ∧ wall(z)

Table 8Ablation experiment for the system components with eclassif, and the best performingconfiguration.

Components Enabled Train Acc. Test Acc.

logic + cwa + coref 73.8 73.4logic + cwa + coref + eclassif 84.0 83.0+ handcoded 84.6 83.2+ handcoded + multiparse 85.0 83.9+ handcoded + multiparse + hyp 85.6 83.9+ handcoded + multiparse + hyp + wlearn 85.7 84.1+ handcoded + multiparse + hyp + wlearn log 85.9 84.3+ handcoded + multiparse + hyp + wlearn log + mem 93.4 85.1+ handcoded + multiparse + hyp + wlearn log + mem + ppdb 93.4 84.9

current state of the art (Lai and Hockenmaier 2014) – 84.6

802


T should entail H (strictly speaking, wall is not a fence but this is a positive entailmentexample in SICK). The modified Robinson resolution yields the following rule:

F: ∀x, y. jump(x) ∧ over(x, y) ∧ wall(y)⇒jump(x) ∧ over(x) ∧ patient(x, y) ∧ wall(y)

Note that in T, the parser treats over as a preposition, whereas in H, jump over istreated as a particle verb. A lexical rule wall ⇒ fence is not enough to get the rightinference because of this inconsistency in the parsing. The rule F reflects this parsinginconsistency. When F is translated to text for the entailment classifier, we obtain jumpover wall ⇒ jump over fence, which is a simple phrase that the entailment classifieraddresses without dealing with the complexities of the logic. Without the modifiedRobinson resolution, we would have had to resort to collecting “structural” inferencerules like ∀x, y. over(x, y)⇒ over(x) ∧ patient(x, y).

Table 8 also shows the impact of hyp and mem, two components that in principleshould not add anything over eclassif, but they do add some accuracy because of noisein the training data of eclassif.

Weight learning results are the rows wlearn and wlearn log. Both weight learningcomponents help improve the system’s accuracy. It is interesting to see that even thoughthe SICK data set is not designed to evaluate “degree of entailment,” it is still useful tokeep the rules uncertain (as opposed to using hard rules) and use probabilistic inference.Results also show that wlearn log performs slightly better than wlearn.

Finally, adding ppdb does not improve the accuracy. Apparently, eclassif alreadycaptures all the useful rules that we were getting from ppdb. It is interesting to see thatsimple distributional information can subsume a large paraphrase database like PPDB.Adding wn (not shown in the table) leads to a slight decrease in accuracy.

The system comprising logic, cwa, coref, multiparse, eclassif, handcoded, hyp,wlearn log, and mem achieves a state-of-the-art accuracy score of 85.1% on the SICKtest set. The entailment rule classifier eclassif plays a vital role in this result.

8. Future Work

One area to explore is contextualization. The evaluation of the entailment rule classifiershowed that some of the entailments are context-specific, like put/pour (which areentailing only for liquids) or push/knock (which is entailing in the context of “pushinga toddler into a puddle”). Cosine-based distributional features were able to identifysome of these cases when all other features did not. We would like to explore whethercontextualized distributional word representations, which take the sentence context intoaccount (Erk and Pado 2008; Thater, Furstenau, and Pinkal 2010; Dinu, Thater, and Laue2012), can identify such context-specific lexical entailments more reliably.

We would also like to explore new ways of measuring lexical entailment. It is wellknown that cosine similarity gives high ratings to co-hyponyms (Baroni and Lenci 2011),and our evaluation confirmed that this is a problem for lexical entailment judgments, asco-hyponyms are usually not entailing. However, co-hyponymy judgments can be usedto position unknown terms in the WordNet hierarchy (Snow, Jurafsky, and Ng 2006).This could be a new way of using distributional information in lexical entailment: usingcosine similarity to position a term in an existing hierarchy, and then using the relationsin the hierarchy for lexical entailment. Although distributional similarity is usually used

803


only on individual word pairs, this technique would use distributional similarity tolearn the meaning of unknown terms, given that many other terms are already known.

This article has focused on the RTE task, but we are interested in applying oursystem to other tasks, in particular, the question answering task. This task is interestingbecause it may offer a wider variety of tasks to the distributional subsystem. Existinglogic-based systems are usually applied to limited domains, such as querying a specificdatabase (Berant et al. 2013; Kwiatkowski et al. 2013), but with our system, we havethe potential to query a large corpus because we are using Boxer for wide-coveragesemantic analysis. The general system architecture discussed in this article can be ap-plied to the question answering task with some modifications. For knowledge base con-struction, the general idea of using theorem proving to infer rules still applies, but thedetails of the technique would be a lot different from the modified Robinson resolutionin section 5.1. For the probabilistic logic inference, scaling becomes a major challenge.

Another important extension to this work is to support generalized quantifiers inprobabilistic logic. Some determiners, such as “few” and “most,” cannot be representedin standard first-order logic, and are usually addressed using higher-order logics. But itcould be possible to represent them using the probabilistic aspect of probabilistic logic,sidestepping the need for higher-order logic.

9. Conclusion

Being able to effectively represent natural language semantics is important and hasmany important applications. We have introduced an approach that uses probabilisticlogic to combine the expressivity and automated inference provided by logical represen-tations, with the ability to capture graded aspects of natural language captured by distri-butional semantics. We evaluated this semantic representation on the RTE task, whichrequires deep semantic understanding. Our system maps natural-language sentencesto logical formulas, uses them to build probabilistic logic inference problems, buildsa knowledge base from precompiled resources and on-the-fly distributional resources,then performs inference using Markov logic. Experiments demonstrated state-of-the-artperformance on the recently introduced SICK RTE task.

AcknowledgmentsThis research was supported by theDARPA DEFT program under AFRL grantFA8750-13-2-0026, by NSF CAREER grantIIS 0845925, and by NSF grant IIS 1523637.Any opinions, findings, and conclusions orrecommendations expressed in this materialare those of the authors, and do notnecessarily reflect the view of DARPA, DoD,or the US government. Some experimentswere run on the Mastodon Cluster supportedby NSF grant EIA-0303609, and the TexasAdvanced Computing Center (TACC) atThe University of Texas at Austin.

ReferencesAlshawi, Hiyan. 1992. The Core Language

Engine. MIT Press.Andrews, Mark, Gabriella Vigliocco,

and David Vinson. 2009. Integrating

experiential and distributional datato learn semantic representations.Psychological Review, 116(3):463–498.

Bach, Stephen H., Bert Huang, Ben London,and Lise Getoor. 2013. Hinge-lossMarkov random fields: Convexinference for structured prediction.In UAI 2013, pages 32–41, Bellevue, WA.

Baroni, Marco, Raffaella Bernardi,Ngoc-Quynh Do, and Chung-chieh Shan.2012. Entailment above the word level indistributional semantics. In EACL 2012,pages 23–32, Avignon.

Baroni, Marco, Raffaella Bernardi, andRoberto Zamparelli. 2014. Frege inspace: A program for compositionaldistributional semantics. Linguistic Issuesin Language Technology, 9(6):5–110.

Baroni, Marco and Alessandro Lenci. 2010.Distributional memory: A generalframework for corpus-based semantics.

804

http://www.mitpressjournals.org/action/showLinks?crossref=10.1037%2Fa0016261


Computational Linguistics, 36(4):673–721.Baroni, Marco and Alessandro Lenci. 2011.

How we BLESSed distributional semanticevaluation. In GEMS 2011 Workshop onGEometrical Models of Natural LanguageSemantics, pages 1–10, Edinburgh.

Baroni, Marco and Roberto Zamparelli. 2010.Nouns are vectors, adjectives are matrices:Representing adjective-noun constructionsin semantic space. In EMNLP 2010,pages 1183–1193, Cambridge, MA.

Beltagy, I., Cuong Chau, Gemma Boleda,Dan Garrette, Katrin Erk, and RaymondMooney. 2013. Montague meets Markov:Deep semantics with probabilistic logicalform. In *SEM 2013, pages 11–21, Atlanta,GA.

Beltagy, I. and Katrin Erk. 2015. On theproper treatment of quantifiers inprobabilistic logic semantics. In IWCS2015, pages 140–150, London.

Beltagy, I., Katrin Erk, and RaymondMooney. 2014. Probabilistic soft logic forsemantic textual similarity. In ACL 2014,pages 1210–1219, Baltimore, MD.

Beltagy, I. and Raymond J. Mooney. 2014.Efficient Markov logic inference fornatural language semantics. In StarAI2014, pages 9–14. Quebec City.

Berant, Jonathan, Andrew Chou, Roy Frostig,and Percy Liang. 2013. Semantic parsingon Freebase from question-answer pairs.In EMNLP 2013, pages 1533–1544, Seattle,WA.

Bird, Steven, Ewan Klein, and Edward Loper.2009. Natural Language Processing withPython. O’Reilly Media, Inc.

Bjerva, Johannes, Johan Bos, Rob van derGoot, and Malvina Nissim. 2014. Themeaning factory: Formal semantics forrecognizing textual entailment anddetermining semantic similarity. InSemEval 2014, pages 642–646, Dublin.

Blackburn, Patrick and Johan Bos. 2005.Representation and Inference for NaturalLanguage: A First Course in ComputationalSemantics. CSLI Publications, Stanford,CA.

Bos, Johan. 2008. Wide-coverage semanticanalysis with Boxer. In Proceedings ofSemantics in Text Processing (STEP-2008),pages 277–286, Venice.

Bos, Johan. 2009. Applying automateddeduction to natural languageunderstanding. Journal of Applied Logic,7:100–112.

Broecheler, Matthias, Lilyana Mihalkova,and Lise Getoor. 2010. ProbabilisticSimilarity Logic. In UAI 2010, pages 73–82,Catalina Island, CA.

Brown, Susan Windisch. 2008. Choosingsense distinctions for WSD:Psycholinguistic evidence. In ACL 2008,pages 249–252, Columbus, OH.

Bruni, Elia, Gemma Boleda, Marco Baroni,and Nam-Khanh Tran. 2012. Distributionalsemantics in technicolor. In ACL 2012,pages 136–145, Jeju Island.

Bunescu, Razvan and Ray Mooney. 2007.Multiple instance learning for sparsepositive bags. In ICML 2007,pages 105–112, Corvalis.

Chang, Chih-Chung and Chih-Jen Lin. 2001.LIBSVM: a library for support vectormachines. Software available athttp://www.csie.ntu.edu.tw/~cjlin/libsvm.

Clark, Stephen. 2012. Vector space models oflexical meaning. In J. van Bentham, M.Moortgat, and W. Buszkowski, editors,Handbook of Contemporary Semantics.Wiley-Blackwell, 2 edition, pages 345–384.

Clark, Stephen and James R. Curran. 2004.Parsing the WSJ using CCG and log-linearmodels. In ACL 2004, pages 103–110,Barcelona.

Coecke, Bob, Mehrnoosh Sadrzadeh, andStephen Clark. 2011. Mathematicalfoundations for a compositionaldistributed model of meaning. LinguisticAnalysis, 36(1-4):345–384.

Cooper, Robin, Simon Dobnik, ShalomLappin, and Staffan Larsson. 2015.Probabilistic type theory and naturallanguage semantics. Linguistic Issues inLanguage Technology, 10(4):1–43.

Copestake, Ann and Dan Flickinger. 2000.An open-source grammar developmentenvironment and broad-coverage Englishgrammar using HPSG. In LREC 2000,Athens.

Dagan, Ido, Dan Roth, Mark Sammons,and Fabio Massimo Zanzotto. 2013.Recognizing textual entailment:Models and applications. SynthesisLectures on Human Language Technologies,6(4):1–220.

Dechter, Rina, Kalev Kask, and RobertMateescu. 2002. Iterative join-graphpropagation. In UAI 2002, pages 128–136,Edmonton.

Dietterich, Thomas G., Richard H. Lathrop,and Tomas Lozano-Perez. 1997. Solvingthe multiple instance problem withaxis-parallel rectangles. ArtificialIntelligence, 89(1-2):31–71.

Dinu, Georgiana, Stefan Thater, and SorenLaue. 2012. A comparison of models ofword meaning in context. In HLT-NAACL2012, pages 611–615, Montreal.

805

http://www.mitpressjournals.org/action/showLinks?crossref=10.1016%2FS0004-3702%2896%2900034-3

http://www.mitpressjournals.org/action/showLinks?crossref=10.1016%2Fj.jal.2007.07.008

http://www.mitpressjournals.org/action/showLinks?crossref=10.2200%2FS00509ED1V01Y201305HLT023

http://www.mitpressjournals.org/action/showLinks?crossref=10.2200%2FS00509ED1V01Y201305HLT023

http://www.mitpressjournals.org/action/showLinks?system=10.1162%2Fcoli_a_00016

http://www.mitpressjournals.org/action/showLinks?crossref=10.1016%2FS0004-3702%2896%2900034-3


Dowty, David R., Robert E. Wall, and StanleyPeters. 1981. Introduction to MontagueSemantics. D. Reidel, Dordrecht, Holland.

Edmonds, Philip and Graeme Hirst. 2000.Reconciling fine-grained lexicalknowledge and coarse-grained ontologiesin the representation of near-synonyms. InProceedings of the Workshop on SemanticApproximation, Granularity, and Vagueness,Breckenridge, CO.

Erk, Katrin and Sebastian Pado. 2008. Astructured vector space model for wordmeaning in context. In EMNLP 2008,pages 897–906, Honolulu, HI.

Fu, Ruiji, Jiang Guo, Bing Qin, WanxiangChe, Haifeng Wang, and Ting Liu. 2014.Learning semantic hierarchies via wordembeddings. In ACL 2014,pages 1199–1209, Baltimore, MD.

Ganitkevitch, Juri, Benjamin Van Durme, andChris Callison-Burch. 2013. PPDB: Theparaphrase database. In NAACL-HLT 2013,pages 758–764, Atlanta, GA.

Garrette, Dan, Katrin Erk, and RaymondMooney. 2011. Integrating logicalrepresentations with probabilisticinformation using Markov logic. InIWCS 2011, pages 105–114, Oxford.

Geffet, Maayan and Ido Dagan. 2005. Thedistributional inclusion hypotheses andlexical entailment. In ACL 2005,pages 107–114, Ann Arbor, MI.

Genesereth, M. R. and N. J. Nilsson. 1987.Logical foundations of artificial intelligence.Morgan Kaufman, San Mateo, CA.

Getoor, L. and B. Taskar. 2007. Introduction toStatistical Relational Learning. MIT Press,Cambridge, MA.

Geurts, Bart. 2007. Existential import. InI. Comorovski and K. van Heusinger,editors, Existence: syntax and semantics.Springer, Dordrecht, pages 253–271.

Gogate, Vibhav. 2014. IJGP-sampling andSampleSearch. Software available athttp://www.hlt.utdallas.edu/~vgogate/ijgp-samplesearch.html.

Gogate, Vibhav and Rina Dechter. 2011.SampleSearch: Importance sampling inpresence of determinism. ArtificialIntelligence, 175(2):694–729.

Gogate, Vibhav and Pedro Domingos. 2011.Probabilistic theorem proving. In UAI2011, pages 256–265, Barcelona.

Grefenstette, Edward. 2013. Towards aformal distributional semantics:Simulating logical calculi with tensors.In *SEM 2013, pages 11–21, Atlanta, GA.

Grefenstette, Edward and MehrnooshSadrzadeh. 2011. Experimental support fora categorical compositional distributional

model of meaning. In EMNLP 2011,pages 1394–1404, Edinburgh, Scotland.

Herbelot, Aurelie and Eva Maria Vecchi.2015. Building a shared world: Mappingdistributional to model-theoretic semanticspaces. In EMNLP 2015, pages 22–32,Lisbon.

Hobbs, Jerry R., Mark E. Stickel, Paul Martin,and Douglas Edwards. 1988. Interpretationas abduction. In ACL 1988, pages 95–103,Buffalo, NY.

Kamp, Hans and Uwe Reyle. 1993. FromDiscourse to Logic. Kluwer.

Kok, Stanley, Parag Singla, MatthewRichardson, and Pedro Domingos.2005. The Alchemy system forstatistical relational AI. http://www.cs.washington.edu/ai/alchemy.

Kotlerman, Lili, Ido Dagan, Idan Szpektor,and Maayan Zhitomirsky-Geffet. 2010.Directional distributional similarity forlexical inference. Natural LanguageEngineering, 16(4):359–389.

Kruszewski, German, Denis Paperno, andMarco Baroni. 2015. Deriving Booleanstructures from distributional vectors.TACL 2015, 3:375–388.

Kwiatkowski, Tom, Eunsol Choi, Yoav Artzi,and Luke Zettlemoyer. 2013. Scalingsemantic parsers with on-the-fly ontologymatching. In EMNLP 2013,pages 1545–1556, Seattle, WA.

Lai, Alice and Julia Hockenmaier. 2014.Illinois-lh: A denotational anddistributional approach to semantics.In SemEval 2014, pages 329–334, Dublin.

Landauer, Thomas and Susan Dumais.1997. A solution to Plato’s problem: TheLatent Semantic Analysis theory of theacquisition, induction, and representationof knowledge. Psychological Review,104(2):211.

Lenci, Alessandro and Giulia Benotto. 2012.Identifying hypernyms in distributionalsemantic spaces. In *SEM 2012,pages 75–79, Montreal.

Levy, Omer, Steffen Remus, Chris Biemann,and Ido Dagan. 2015. Do superviseddistributional methods really learn lexicalinference relations? In NAACL HLT2015, pages 970–976, Denver, CO.

Lewis, Mike and Mark Steedman. 2013.Combined distributional and logicalsemantics. TACL 2013, 1:179–192.

Lewis, Mike and Mark Steedman. 2014. A*CCG parsing with a supertag-factoredmodel. In EMNLP 2014, pages 990–1000,Doha.

Liang, Percy, Michael Jordan, and Dan Klein.2011. Learning dependency-based

806

http://www.mitpressjournals.org/action/showLinks?crossref=10.1037%2F0033-295X.104.2.211

http://www.mitpressjournals.org/action/showLinks?crossref=10.1016%2Fj.artint.2010.10.009

http://www.mitpressjournals.org/action/showLinks?crossref=10.1016%2Fj.artint.2010.10.009

http://www.mitpressjournals.org/action/showLinks?crossref=10.1017%2FS1351324910000124

http://www.mitpressjournals.org/action/showLinks?crossref=10.1017%2FS1351324910000124


compositional semantics. In ACL-HLT2011, pages 590–599, Portland, OR.

Lund, Kevin and Curt Burgess. 1996.Producing high-dimensional semanticspaces from lexical co-occurrence. BehaviorResearch Methods, Instruments, andComputers, 28(2):203–208.

MacCartney, Bill and Christopher D.Manning. 2009. An extended model ofnatural logic. In IWCS 2009, pages 140–156,Tilburg.

Marelli, Marco, Luisa Bentivogli, MarcoBaroni, Raffaella Bernardi, Stefano Menini,and Roberto Zamparelli. 2014a.SemEval-2014 task 1: Evaluation ofcompositional distributional semanticmodels on full sentences through semanticrelatedness and textual entailment. InSemEval 2014, pages 1–8, Dublin.

Marelli, Marco, Stefano Menini, MarcoBaroni, Luisa Bentivogli, RaffaellaBernardi, and Roberto Zamparelli. 2014b.A SICK cure for the evaluation ofcompositional distributional semanticmodels. In LREC 2014, pages 216–223,Reykjavik.

Mikolov, Tomas, Kai Chen, Greg Corrado,and Jeffrey Dean. 2013. Efficient estimationof word representations in vector space.In ICLR 2013, Scottsdale, AZ.

Mikolov, Tomas, Wentau Yih, and GeoffreyZweig. 2013. Linguistic regularities incontinuous space word representations.In NAACL-HLT 2013, pages 746–751,Atlanta, GA.

Mitchell, Jeff and Mirella Lapata. 2008.Vector-based models of semanticcomposition. In ACL 2008, pages 236–244,Columbus, OH.

Mitchell, Jeff and Mirella Lapata. 2010.Composition in distributional modelsof semantics. Cognitive Science,34(3):1388–1429.

Montague, Richard. 1970. Universalgrammar. Theoria, 36:373–398.

Ng, Dominick and James R Curran. 2012.Dependency hashing for n-best CCGparsing. In ACL 2012, pages 497–505, JejuIsland.

Paperno, Denis, Nghia The Pham, andMarco Baroni. 2014. A practical andlinguistically-motivated approach tocompositional distributional semantics.In ACL 2014, pages 90–99, Baltimore, MD.

Parsons, Terry. 1990. Events in the semantics ofEnglish. MIT Press, Cambridge, MA.

Pasupat, Panupong and Percy Liang. 2015.Compositional semantic parsing onsemi-structured tables. In ACL 2015,pages 1470–1480, Beijing.

Pearl, Judea. 1988. Probabilistic Reasoning inIntelligent Systems: Networks of PlausibleInference. Morgan Kaufmann, San Mateo,CA.

Poon, Hoifung and Pedro Domingos. 2008.Joint unsupervised coreference resolutionwith Markov logic. In EMNLP 2008,pages 650–659, Honolulu, HI.

Princeton University. 2010. About WordNet.http://wordnet.princeton.edu.

Richardson, Matthew and Pedro Domingos.2006. Markov logic networks. MachineLearning, 62:107–136.

Riedel, Sebastian, Hong-Woo Chun,Toshihisa Takagi, and Jun’ichi Tsujii. 2009.A Markov logic approach to bio-molecularevent extraction. In BioNLP 2009,pages 41–49, Boulder, CA.

Riedel, Sebastian and Ivan Meza-Ruiz.2008. Collective semantic role labellingwith Markov logic. In CoNLL’08,pages 193–197, Manchester.

Robinson, J. A. 1965. A machine-orientedlogic based on the resolution principle.Journal of the ACM, 12(1):23–41.

Roller, Stephen, Katrin Erk, and GemmaBoleda. 2014. Inclusive yet selective:Supervised distributional hypernymydetection. In COLING 2014,pages 1025–1036, Dublin.

Sadrzadeh, Mehrnoosh, Stephen Clark, andBob Coecke. 2013. The Frobenius anatomyof word meanings I: subject and objectrelative pronouns. Journal of Logic andComputation, 23(6):1293–1317.

Santus, Enrico. 2013. SLQS: An entropymeasure. Master’s thesis, University of Pisa.

Schutze, Hinrich. 1998. Automatic wordsense discrimination. ComputationalLinguistics, 24(1):97–123.

Silberer, Carina and Mirella Lapata. 2012.Grounded models of semanticrepresentation. In EMNLP-CoNLL 2012,pages 1423–1433, Jeju Island.

Skolem, Thoralf. 1920. Logisch-kombinatorische Untersuchungen uber dieErfullbarkeit oder Beweisbarkeitmathematischer Satze. Skrifter utgit avVidenskapselskapet i Kristiania, 4:4–36.

Snow, Rion, Daniel Jurafsky, and Andrew Y.Ng. 2006. Semantic taxonomy inductionfrom heterogenous evidence. In COLING/ACL 2006, pages 801–808, Sydney.

Strawson, P. F. 1950. On referring. Mind,59:320–344.

Thater, Stefan, Hagen Furstenau, andManfred Pinkal. 2010. Contextualizingsemantic representations usingsyntactically enriched vector models.In ACL 2010, pages 948–957, Uppsala.

807

http://www.mitpressjournals.org/action/showLinks?crossref=10.1145%2F321250.321253

http://www.mitpressjournals.org/action/showLinks?crossref=10.1093%2Flogcom%2Fext044

http://www.mitpressjournals.org/action/showLinks?crossref=10.1093%2Flogcom%2Fext044

http://www.mitpressjournals.org/action/showLinks?crossref=10.1111%2Fj.1551-6709.2010.01106.x

http://www.mitpressjournals.org/action/showLinks?crossref=10.3758%2FBF03204766



http://www.mitpressjournals.org/action/showLinks?crossref=10.1007%2Fs10994-006-5833-1

http://www.mitpressjournals.org/action/showLinks?crossref=10.1007%2Fs10994-006-5833-1

http://www.mitpressjournals.org/action/showLinks?crossref=10.1093%2Fmind%2FLIX.235.320

http://www.mitpressjournals.org/action/showLinks?crossref=10.1111%2Fj.1755-2567.1970.tb00434.x


Tian, Ran, Yusuke Miyao, and MatsuzakiTakuya. 2014. Logical inference ondependency-based compositionalsemantics. In ACL 2014, pages 78–89,Baltimore, MD.

Turney, Peter and Patrick Pantel. 2010.From frequency to meaning: Vectorspace models of semantics. Journal ofArtificial Intelligence Research,37(1):141–188.

van Eijck, Jan and Shalom Lappin. 2012.Probabilistic semantics for naturallanguage. In Logic and interactive rationality(LIRA) yearbook. Amsterdam DynamicsGroup, pages 17–35.

van Eijck, Jan and Christina Unger. 2010.Computational Semantics with FunctionalProgramming. Cambridge University Press.

Weeds, Julie, Daoud Clarke, Jeremy Reffin,David Weir, and Bill Keller. 2014.Learning to distinguish hypernymsand co-hyponyms. In COLING 2014,pages 2249–2259, Dublin.

Weeds, Julie, David Weir, and DianaMcCarthy. 2004. Characterising measuresof lexical distributional similarity. InCOLING 2004, pages 1015–1021, Geneva.

Zelle, John M. and Raymond J. Mooney. 1996.Learning to parse database queries usinginductive logic programming. In AAAI1996, pages 1050–1055, Portland, OR.

Zirn, Cacilia, Mathias Niepert, HeinerStuckenschmidt, and Michael Strube. 2011.Fine-grained sentiment analysis withstructural features. In IJCNLP 2011,pages 336–344, Chiang Mai.

808

Documents

Representing Meaning with a Combination of Logical and … · 2017. 2. 2. · assumptions of the probabilistic logic (Beltagy and Erk 2015). For distributional lexical and phrasal