14
Deriving Boolean structures from distributional vectors German Kruszewski Denis Paperno Center for Mind/Brain Sciences University of Trento [email protected] Marco Baroni Abstract Corpus-based distributional semantic mod- els capture degrees of semantic relatedness among the words of very large vocabular- ies, but have problems with logical phe- nomena such as entailment, that are in- stead elegantly handled by model-theoretic approaches, which, in turn, do not scale up. We combine the advantages of the two views by inducing a mapping from distributional vectors of words (or sentences) into a Boolean structure of the kind in which natural lan- guage terms are assumed to denote. We evaluate this Boolean Distributional Seman- tic Model (BDSM) on recognizing entailment between words and sentences. The method achieves results comparable to a state-of-the- art SVM, degrades more gracefully when less training data are available and displays inter- esting qualitative properties. 1 Introduction Different aspects of natural language semantics have been studied from different perspectives. Distribu- tional semantic models (Turney and Pantel, 2010) induce large-scale vector-based lexical semantic representations from statistical patterns of word us- age. These models have proven successful in tasks relying on meaning relatedness, such as synonymy detection (Landauer and Dumais, 1997), word sense discrimination (Sch¨ utze, 1997), or even measuring phrase plausibility (Vecchi et al., 2011). On the other hand, logical relations and operations, such as entailment, contradiction, conjunction and nega- tion, receive an elegant treatment in formal seman- tic models. The latter lack, however, general pro- cedures to learn from data, and consequently have problems scaling up to real-life problems. Formal semantics captures fundamental aspects of meaning in set-theoretic terms: Entailment, for example, is captured as the inclusion relation be- tween the sets (of the relevant type) denoted by words or other linguistic expressions, e.g., sets of possible worlds that two propositions hold of (Chier- chia and McConnell-Ginet, 2000, 299). In finite models, a mathematically convenient way to repre- sent these denotations is to encode them in Boolean vectors, i.e., vectors of 0s and 1s (Sasao, 1999, 21). Given all elements e i in the domain in which linguis- tic expressions of a certain type denote, the Boolean vector associated to a linguistic expression of that type has 1 in position i if e i S for S the set de- noted by the expression, 0 otherwise. An expression a entailing b will have a Boolean vector including the one of b, in the sense that all positions occupied by 1s in the b vector are also set to 1 in the a vec- tor. Very general expressions (entailing nearly ev- erything else) will have very dense vectors, whereas very specific expressions will have very sparse vec- tors. The negation of an expression a will denote a “flipped” version of the a Boolean vector. Vice versa, two expressions with at least partially com- patible meanings will have some overlap of the 1s in their vectors; conjunction and disjunction are carried through with the obvious bit-wise operations, etc. To narrow the gap between the large-scale induc- tive properties of distributional semantic models and the logical power of Boolean semantics, we cre- ate Boolean meaning representations that build on the wealth of information inherent in distributional vectors of words (and sentences). More precisely, we use word (or sentence) pairs labeled as entailing 375 Transactions of the Association for Computational Linguistics, vol. 3, pp. 375–388, 2015. Action Editor: Alexander Koller. Submission batch: 4/2015; Published 6/2015. c 2015 Association for Computational Linguistics. Distributed under a CC-BY-NC-SA 4.0 license.

Deriving Boolean structures from distributional vectorsark/EMNLP-2015/proceedings/EMNLP... · 2015-09-03 · Deriving Boolean structures from distributional vectors German Kruszewski

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Deriving Boolean structures from distributional vectorsark/EMNLP-2015/proceedings/EMNLP... · 2015-09-03 · Deriving Boolean structures from distributional vectors German Kruszewski

Deriving Boolean structures from distributional vectors

German Kruszewski Denis Paperno

Center for Mind/Brain SciencesUniversity of Trento

[email protected]

Marco Baroni

Abstract

Corpus-based distributional semantic mod-els capture degrees of semantic relatednessamong the words of very large vocabular-ies, but have problems with logical phe-nomena such as entailment, that are in-stead elegantly handled by model-theoreticapproaches, which, in turn, do not scale up.We combine the advantages of the two viewsby inducing a mapping from distributionalvectors of words (or sentences) into a Booleanstructure of the kind in which natural lan-guage terms are assumed to denote. Weevaluate this Boolean Distributional Seman-tic Model (BDSM) on recognizing entailmentbetween words and sentences. The methodachieves results comparable to a state-of-the-art SVM, degrades more gracefully when lesstraining data are available and displays inter-esting qualitative properties.

1 Introduction

Different aspects of natural language semantics havebeen studied from different perspectives. Distribu-tional semantic models (Turney and Pantel, 2010)induce large-scale vector-based lexical semanticrepresentations from statistical patterns of word us-age. These models have proven successful in tasksrelying on meaning relatedness, such as synonymydetection (Landauer and Dumais, 1997), word sensediscrimination (Schutze, 1997), or even measuringphrase plausibility (Vecchi et al., 2011). On theother hand, logical relations and operations, suchas entailment, contradiction, conjunction and nega-tion, receive an elegant treatment in formal seman-tic models. The latter lack, however, general pro-

cedures to learn from data, and consequently haveproblems scaling up to real-life problems.

Formal semantics captures fundamental aspectsof meaning in set-theoretic terms: Entailment, forexample, is captured as the inclusion relation be-tween the sets (of the relevant type) denoted bywords or other linguistic expressions, e.g., sets ofpossible worlds that two propositions hold of (Chier-chia and McConnell-Ginet, 2000, 299). In finitemodels, a mathematically convenient way to repre-sent these denotations is to encode them in Booleanvectors, i.e., vectors of 0s and 1s (Sasao, 1999, 21).Given all elements ei in the domain in which linguis-tic expressions of a certain type denote, the Booleanvector associated to a linguistic expression of thattype has 1 in position i if ei ∈ S for S the set de-noted by the expression, 0 otherwise. An expressiona entailing b will have a Boolean vector includingthe one of b, in the sense that all positions occupiedby 1s in the b vector are also set to 1 in the a vec-tor. Very general expressions (entailing nearly ev-erything else) will have very dense vectors, whereasvery specific expressions will have very sparse vec-tors. The negation of an expression a will denotea “flipped” version of the a Boolean vector. Viceversa, two expressions with at least partially com-patible meanings will have some overlap of the 1s intheir vectors; conjunction and disjunction are carriedthrough with the obvious bit-wise operations, etc.

To narrow the gap between the large-scale induc-tive properties of distributional semantic models andthe logical power of Boolean semantics, we cre-ate Boolean meaning representations that build onthe wealth of information inherent in distributionalvectors of words (and sentences). More precisely,we use word (or sentence) pairs labeled as entailing

375

Transactions of the Association for Computational Linguistics, vol. 3, pp. 375–388, 2015. Action Editor: Alexander Koller.Submission batch: 4/2015; Published 6/2015.

c©2015 Association for Computational Linguistics. Distributed under a CC-BY-NC-SA 4.0 license.

Page 2: Deriving Boolean structures from distributional vectorsark/EMNLP-2015/proceedings/EMNLP... · 2015-09-03 · Deriving Boolean structures from distributional vectors German Kruszewski

or not entailing to train a mapping from their dis-tributional representations to Boolean vectors, en-forcing feature inclusion in Boolean space for theentailing pairs. By focusing on inducing Booleanrepresentations that respect the inclusion relation,our method is radically different from recent super-vised approaches that learn an entailment classifierdirectly on distributional vectors, without enforcinginclusion or other representational constraints. Weshow, experimentally, that the method is competitiveagainst state-of-the-art techniques in lexical entail-ment, improving on them in sentential entailment,while learning more effectively from less trainingdata. This is crucial for practical applications that in-volve bigger and more diverse data than the focusedtest sets we used for testing. Moreover, extensivequalitative analysis reveals several interesting prop-erties of the Boolean vectors we induce, suggestingthat they are representations of greater generality be-yond entailment, that might be exploited in furtherwork for other logic-related semantic tasks.

2 Related work

Entailment in distributional semantics Due tothe lack of methods to induce the relevant represen-tations on the large scale needed for practical tasks,the Boolean structure defined by the entailment rela-tion is typically not considered in efforts to automat-ically recognize entailment between words or sen-tences (Dagan et al., 2009). On the other hand, someresearchers relying on distributional representationsof meaning have attempted to apply various versionsof the notion of feature inclusion to entailment de-tection. This is based on the intuitive idea – theso-called distributional inclusion hypothesis – thatthe features (vector dimensions) of a hypernym anda hyponym should be in a superset-subset relation,analogously to what we are trying to achieve in theBoolean space we induce, but directly applied to dis-tributional vectors (Geffet and Dagan, 2005; Kotler-man et al., 2010; Lenci and Benotto, 2012; Weedset al., 2004). It has been noticed that distributionalcontext inclusion defines a Boolean structure on vec-tors just as entailment defines a Boolean structureon formal semantic representations (Clarke, 2012).However, the match between context inclusion andentailment is far from perfect.

First, distributional vectors are real-valued andcontain way more nuanced information than simplyinclusion or exclusion of certain features. Second,and more fundamentally, the information encoded indistributional vectors is simply not of the right kindsince “feature inclusion” for distributional vectorsboils down to contextual inclusion, and there is noreason to think that a hypernym should occur in allthe contexts in which its hyponyms appear. For ex-ample, bark can be a typical context for dog, but wedon’t expect to find it a significant number of timeswith mammal even in a very large corpus. In practicedistributional inclusion turns out to be a weak toolfor recognizing the entailment relation (Erk, 2009;Santus et al., 2014) because denotational and distri-butional inclusion are independent properties.

More recently, several authors have explored su-pervised methods. In particular, Baroni et al. (2012),Roller et al. (2014) and Weeds et al. (2014) showthat a Support Vector Machine trained on the distri-butional vectors of entailing or non-entailing pairsoutperform the distributional inclusion measures. Inour experiments, we will use this method as themain comparison point. The similarly supervisedapproach of Turney and Mohammad (2014) assumesthe representational framework of Turney (2012),and we do not attempt to re-implement it here.

Very recently, other properties of distributionalvectors, such as entropy (Santus et al., 2014) andtopical coherence (Rimell, 2014), have been pro-posed as entailment cues. Since they are not basedon feature inclusion, we see them as complementary,rather than alternative to our proposal.

Formal and distributional semantic models Wetry to derive a structured representation inspired byformal semantic theories from data-driven distribu-tional semantic models. Combining the two ap-proaches has proven a hard task. Some systemsadopt logic-based representations but use distribu-tional evidence for predicate disambiguation (Lewisand Steedman, 2013) or to weight probabilistic in-ference rules (Garrette et al., 2013; Beltagy et al.,2013). Other authors propose ways to encode as-pects of logic-based representations such as logi-cal connectives and truth values (Grefenstette, 2013)or predicate-argument structure (Clark and Pulman,2007) in a vector-based framework. These studies

376

Page 3: Deriving Boolean structures from distributional vectorsark/EMNLP-2015/proceedings/EMNLP... · 2015-09-03 · Deriving Boolean structures from distributional vectors German Kruszewski

are, however, entirely theoretical. Rocktaschel et al.(2015) expand on the first, allowing for some gener-alization to unseen knowledge, by introducing somedegree of fuzziness into the representations of pred-icates and terms. Still, this work does not attemptto map concepts to a logic-based representation nortries to exploit the wealth of information containedin distributional vectors.

Socher et al. (2013), Bordes et al. (2012) and Je-natton et al. (2012) try to discover unseen facts froma knowledge base, which can be seen as a form ofinference based on a restricted predicate logic. Todo so, they build vector representations for entities,while relations are represented through classifiers.Only Socher et al. (2013) harness distributional vec-tors, and just as initialization values. The others,unlike us, do not build on independently-motivatedword representations. Moreover, since the repre-sentations are learned from entities present in theirknowledge base, one cannot infer the properties ofunseen concepts.

In the spirit of inducing a variety of logical re-lations and operators (including entailment), Bow-man (2013) applies a softmax classifier to the com-bined distributional representation of two givenstatements, which are in turn learned composition-ally in a supervised fashion in order to guess the re-lation between them. The paper, however, only eval-uates the model on a small restricted dataset, and itis unclear whether the method would scale to real-world challenges.

None of the papers with concrete implemen-tations reviewed above tries, like us, to learn aBoolean structure where entailment corresponds toinclusion. A paper that does attempt to exploit asimilar idea is Young et al. (2014), which also usesthe notion of model from Formal Semantics to rec-ognize entailment based on denotations of wordsand phrases. However, since the denotations intheir approach are ultimately derived from human-generated captions of images, the method does notgeneralize to concepts that are not exemplified in thetraining database.

Finally, a number of studies, both theoretical (Ba-roni et al., 2014a; Coecke et al., 2010) and empirical(Paperno et al., 2014; Polajnar et al., 2014), adaptcompositional methods from formal semantics todistributional vectors, in order to derive representa-

tions of phrases and sentences. This line of researchapplies formal operations to distributional repre-sentations, whereas we derive formal-semantics-likerepresentations from distributional ones. Below, weapply our method to input sentence vectors con-structed with the composition algorithm of Papernoet al. (2014).

3 The Boolean Distributional SemanticModel

We build the Boolean Distributional SemanticModel (BDSM) by mapping real-valued vectorsfrom a distributional semantic model into Boolean-valued vectors, so that feature inclusion in Booleanspace corresponds to entailment between words (orsentences). That is, we optimize the mapping func-tion so that, if two words (or sentences) entail eachother, then the more specific one will get a Booleanvector included in the Boolean vector of the moregeneral one. The is illustrated in Figure 1.

Our model differs crucially from a neural networkwith a softmax objective in imposing a strong biason the hypothesis space that it explores. In contrastto the latter, it only learns the weights correspondingto the mapping, while all other operations in the net-work (in particular, the inference step) are fixed inadvance. The goal of such a bias is to improve learn-ing efficiency and generalization using prior knowl-edge of the relation that the model must capture.

We will now discuss how the model is formalizedin an incremental manner. The goal of the model isto find a functionMΘ (with parameters Θ) that mapsthe distributional representations into the Booleanvector space. To facilitate optimization, we relax theimage of this mapping to be the full [0, 1] interval,thus defining MΘ : RN 7→ [0, 1]H . This mappinghas to respect the following condition as closely aspossible: For two given words (or other linguisticexpressions) p and q, and their distributional vectorsvp and vq, all the active features (i.e., those havingvalue close to 1) of MΘ(vp) must also be active inMΘ(vq) if and only if p⇒ q.

To find such a mapping, we assume training datain the form of a sequence [(pk, qk), yk]mk=1 contain-ing both positive (pk ⇒ qk and yk = 1) and negativepairs (pk ; qk and yk = 0). We learn the mappingby minimizing the difference between the model’s

377

Page 4: Deriving Boolean structures from distributional vectorsark/EMNLP-2015/proceedings/EMNLP... · 2015-09-03 · Deriving Boolean structures from distributional vectors German Kruszewski

Figure 1: The BDSM architecture. a) Input distribu-tional space b) Training of a MappingM where eachoutput dimension Mi can be seen as a (linear) cut inthe original distributional space. c) Output represen-tations after mapping. d) Fragment of the Booleanstructure with example output representations.

entailment predictions (given by a function hΘ) andthe training targets, as measured by the MSE:

J(Θ) =1

2

m∑

k=1

(hΘ(pk, qk)− yk)2 (1)

Here, we defineMΘ as a sigmoid function appliedto a linear transformation: MΘ(x) = g(Wx+b) andg(x) = 1

1+e−xt

, where t stands for an extra “temper-

ature” parameter. We represent the W ∈ RH×N ,b ∈ RH parameters succinctly by Θ = [W, b].

The calculation of hΘ(p, q) involves a series ofsteps that can be construed as the architecture of aneural network, schematically represented in Figure2. Recall that the output value of this function repre-sents the model’s prediction of the truth value forpk ⇒ qk. Here is an outline of how it is calcu-lated. For each pair of words (or sentences) (p, q)in the training set, we map them onto their (soft)boolean correlates (r, s) by applying MΘ to theircorresponding distributional vectors. Next, we mea-sure whether features that are active in r are alsoactive in s (analogously to how Boolean implicationworks), obtaining a soft Boolean vector w. Finally,the output of h can be close to 1 only if all values in

Figure 2: Schematic view of the entailment hypoth-esis function hΘ. Solid links represent calculationsthat are fixed during learning, while dashed linksrepresent the parameters Θ, which are being learned.The p and q input distributional vectors correspond-ing to each data point are fixed, r and s are theirrespective mapped Boolean representations. The wlayer is a feature-inclusion detector and h is the finalentailment judgment produced by the network.

w are also close to 1. Thus, we compute the outputvalue of h as the conjunction across all dimensionsin w.

Concretely, hΘ(p, q) is obtained as follows. Thepassage from the first to the second layer is com-puted as rΘ = MΘ(vp) and sΘ = MΘ(vq). Next,we compute whether the features that are active inrΘ are also active in sΘ. Given that we are workingin the [0, 1] range, we approximate this operation aswΘi = max (1− rΘi, sΘi)

1. It is easy to see thatif rΘi = 0, then wΘi = 1. Otherwise, sΘi mustalso be equal to 1 for wΘi to be 1. Finally, we com-pute hΘ = miniwΘi. This is a way to computethe conjunction over the whole previous layer, thuschecking whether all the features of rΘ are includedin those of sΘ

2.Finally, to allow for better generalization, the cost

function is extended with two more components.The fist one is a L2 regularization term weighted by

1In practice, we use a differentiable approximation givenby max(x, y) ≈ log(eLx+eLy)

L, where L is a sufficiently large

number. We set L = 100, which yields results accurate enoughfor our purposes.

2Analogously, we use the differentiable approximation

given by min(wθ) = − log(∑i e

−LwθiL

)

378

Page 5: Deriving Boolean structures from distributional vectorsark/EMNLP-2015/proceedings/EMNLP... · 2015-09-03 · Deriving Boolean structures from distributional vectors German Kruszewski

a parameter λ. The second one is a term that en-forces sparsity of the resulting representations basedon some desired level ρ.

3.1 Assessing entailment with BDSMDuring training, positive pairs p⇒ q are required tosatisfy full feature inclusion in their mapped repre-sentations (all the active features of MΘ(vp) mustalso be in MΘ(vq)). At test time, we relax thiscondition to grant the model some flexibility. Con-cretely, entailment is quantified by the BI (“BooleanInclusion”) function, counting the proportion of fea-tures in the antecedent that are also present in theconsequent after binarizing the outputs:

BI(u, v) =

∑i rnd(MΘ(u)i) rnd(MΘ(v)i)∑

i rnd(MΘ(u)i)

where rnd(x) = 1 [x > 0.5]. The 0.5 thresholdcomes from construing each of the features in theoutput of M as probabilities. Of course, other for-mulas could be used to quantify entailment throughBDSM, but we leave this to further research.

Since BI returns continuous values, we use devel-opment data to calculate a threshold e above whichan entailment response is returned.

4 Evaluation setup

4.1 Distributional semantic spacesOur approach is agnostic to the kind of distribu-tional representation used, since it doesn’t modifythe input vectors, but builds on top of them. Still,it is interesting to test whether specific kinds of dis-tributional vectors are better suited to act as inputto BDSM. For our experiments, we use both thecount and predict distributional semantic vectors ofBaroni et al. (2014b).3 These vectors were shownby their creators to reach the best average perfor-mance (among comparable alternatives) on a varietyof semantic relatedness/similarity tasks, such as syn-onymy detection, concept categorization and anal-ogy solving. If the same vectors turn out to alsoserve as good inputs for constructing Boolean rep-resentations, we are thus getting the best of bothworlds: distributional vectors with proven high per-formance on relatedness/similarity tasks which can

3http://clic.cimec.unitn.it/composes/semantic-vectors.html

be mapped into a Boolean space to tackle logic-related tasks. We also experiment with the pre-trained vectors from TypeDM (Baroni and Lenci,2010),4 which are built by exploiting syntactic infor-mation, and should have different qualitative proper-ties from the window-based approaches.

The count vectors of Baroni and colleagues arebuilt from a 2-word-window co-occurrence matrixof 300k lower-cased words extracted from a 2.8 bil-lion tokens corpus. The matrix is weighted usingpositive Pointwise Mutual Information (Church andHanks, 1990). We use the full 300k×300k positivePMI matrix to compute the asymmetric similaritymeasures discussed in the next section, since the lat-ter are designed for non-negative, sparse, full-rankrepresentations. Due to efficiency constraints, forBDSM and SVM (also presented next), the matrixis reduced to 300 dimensions by Singular Value De-composition (Schutze, 1997). The experiments ofBaroni et al. (2014b) with these very same vectorssuggest that SVD is lowering performance some-what. So we are, if anything, giving an advantageto the simple asymmetric measures.

The predict vectors are built with the word2vectool (Mikolov et al., 2013) on the same corpus andfor the same vocabulary as the count vectors, usingthe CBOW method. They are constructed by as-sociating 400-dimensional vectors to each word inthe vocabulary and optimizing a single-layer neuralnetwork that, while traversing the training corpus,tries to predict the word in the center of a 5-wordwindow from the vectors of those surrounding it.The word2vec subsampling parameter (that down-weights the impact of frequent words) is set to 1e−5.

Finally, TypeDM vectors were induced from thesame corpus by taking into account the dependencylinks of a word with its sentential collocates. SeeBaroni and Lenci (2010) for details.

Composition methods For sentence entailment(Section 6), we need vectors for sentences, ratherthan words. We derive them from the count vec-tors compositionally in two different ways.5 First,we use the additive model (add), under which we

4http://clic.cimec.unitn.it/dm5A reviewer notes that composition rules could also be in-

duced directly on the entailment task. This is an interesting pos-sibility, but note that it would probably require a larger trainingset than we have available. Moreover, from a theoretical per-

379

Page 6: Deriving Boolean structures from distributional vectorsark/EMNLP-2015/proceedings/EMNLP... · 2015-09-03 · Deriving Boolean structures from distributional vectors German Kruszewski

sum the vectors of the words they contain to ob-tain sentence representations (Mitchell and Lapata,2010). This approach, however, does not take intoaccount word order, which is of obvious relevance todetermining entailment between phrases. For exam-ple, a dog chases a cat does not entail a cat chasesa dog, whereas each sentence entails itself. There-fore, we also used sentence vectors derived with thelinguistically-motivated “practical lexical function”model (plf), that takes syntactic structure and wordorder into account (Paperno et al., 2014). In short,words acting as argument-taking functions (such asverbs) are not only associated to vectors, but also toone matrix for each argument they take (e.g., eachtransitive verb comes with a subject and an objectmatrix). Vector representations of arguments are re-cursively multiplied by function matrices, followingthe syntactic structure of a sentence. The final sen-tence representation is obtained by summing all theresulting vectors. We used pre-trained vector andmatrix representations provided by Paperno and col-leagues. Their setup is very comparable to the one ofour count vectors: same source corpus, similar win-dow size (3-word-window), positive PMI, and SVDreduction to 300 dimensions. The only notable dif-ferences are a vocabulary cut-off to the top 30K mostfrequent words in the corpus, and the use of contentwords only as windows.

4.2 Alternative entailment measures

As reviewed in Section 2, the literature on entail-ment with distributional methods has been domi-nated by the idea of feature inclusion. We thus com-pare BDSM to a variety of state-of-the art asym-metric similarity measures based on the distribu-tional inclusion hypothesis (the dimensions of hy-ponym/antecedent vectors are included in those oftheir hypernyms/consequents). We consider themeasures described in Lenci and Benotto (2012)(clarkeDE, weedsPrec, cosWeeds, and invCL), aswell as balAPinc, which was shown to achieve op-timal performance by Kotlerman et al. (2010). Allthese measures provide a score that is higher whena significant part of the candidate antecedent fea-

spective, we are interested in testing general methods of com-position that are also good for other tasks (e.g., modeling sen-tence similarity), rather than developing ad-hoc compositionrules specifically for entailment.

tures (=dimensions) are included in those of the con-sequent. The measures are only meaningful whencomputed on a non-negative sparse space. There-fore, we evaluate them using the full count space.As an example, weedsPrec is computed as follows:

weedsPrec(u, v) =

∑i 1[vi > 0] · ui∑

i ui

where u is the distributional vector of the an-tecedent, v that of the consequent.6

Finally, we implement a full-fledged supervisedmachine learning approach directly operating ondistributional representations. Following the re-cent literature reviewed in Section 2 above, wetrain a Support Vector Machine (SVM) (Cristianiniand Shawe-Taylor, 2000) on the concatenated dis-tributional vectors of the training pairs, and judgethe presence of entailment for a test pair based onthe same concatenated representation (the results ofWeeds et al. (2014) and Roller et al. (2014) suggestthat concatenation is the most reliable way to con-struct SVM input representations that take both theantecedent and the consequent into account).

4.3 Data setsLexical entailment We test the models on bench-marks derived from two existing resources. We usedthe Lexical Entailment Data Set (LEDS) from Ba-roni et al. (2012) that contains both entailing (ob-tained by extracting hyponym-hypernym links fromWordNet) and non-entailing pairs of words (con-structed by reversing a third of the pairs and ran-domly shuffling the rest). We edited this resource byremoving dubious data from the entailing pairs (e.g.,logo/signal, mankind/mammal, geek/performer) andadding more negative cases (non-entailing pairs),obtained by shuffling words in the positive exam-ples. We derived two balanced subsets: a develop-ment set (LEDS-dev) with 236 pairs in each classand a core set with 911 pairs in each class (LEDS-core), such that there is no lexical overlap betweenthe positive classes of each set, and negative classoverlap is minimized. Since a fair amount of neg-ative cases were obtained by randomly shufflingwords from the positive examples, leading to manyunrelated couples, just pair similarity might be a

6BI is equivalent to weedsPrec in Boolean space.

380

Page 7: Deriving Boolean structures from distributional vectorsark/EMNLP-2015/proceedings/EMNLP... · 2015-09-03 · Deriving Boolean structures from distributional vectors German Kruszewski

Positive Negative

LEDS elephant→ animal ape 9 book

LEDS-dir animal 9 elephant

BLESS-coord elephant→ herbivore elephant 9 hippo

BLESS-mero elephant 9 trunk

Table 1: Lexical entailment examples.

very strong baseline here. We thus explore a morechallenging setup, LEDS-dir, where we replace thenegative examples of LEDS-core by positive pairs inreverse order, thus focusing on entailment direction.

We derive two more benchmarks from BLESS(Baroni and Lenci, 2011). BLESS lists pairs of con-cepts linked by one of 5 possible relations: coordi-nates, hypernymy, meronymy, attributes and events.We employed this resource to construct BLESS-coord, which –unlike LEDS, where entailing pairshave to be distinguished from pairs of words that,mostly, bear no relation– is composed of 1,236super-subordinate pairs (which we treat as positiveexamples) to be distinguished from 3,526 coordinatepairs. BLESS-mero has the same positive exam-ples, but 2,943 holo-meronyms pairs as negatives.Examples of all lexical benchmarks are given in Ta-ble 1.

Sentence entailment To evaluate the models onrecognizing entailment between sentences, we usea benchmark derived from SICK (Marelli et al.,2014b). The original data set contains pairs of sen-tences in entailment, contradiction and neutral re-lations. We focus on recognizing entailment, treat-ing both contradictory and neutral pairs as nega-tive examples (as in the classic RTE shared tasksup to 2008).7 Data are divided into a developmentset (SICK-dev) with 500 sentence pairs (144 posi-tive, 356 negative), a training set (SICK-train) with4,500 pairs (1,299 positive, 3,201 negative) and atest set (SICK-test) with 4,927 pairs (1,414 positive,3,513 negative). Examples from SICK are given in

7This prevents a direct comparison with the results of theSICK shared task at SemEval (Marelli et al., 2014a). However,all competitive SemEval systems were highly engineered forthe task, and made extensive use of a variety of pre-processingtools, features and external resources (cf. Table 8 of Marelli etal. (2014a)), so that a fair comparison with our simpler methodswould not be possible in any case.

Positive Negative

A man is slowly trekkingin the woods→ The manis hiking in the woods

A group of scouts arecamping in the grass 9A group of scouts arehiking through the grass

Table 2: SICK sentence entailment examples.

Table 2.

4.4 Training regime

We tune once and for all the hyperparameters of themodels by maximizing accuracy on the small LEDS-dev set. For SVM, we tune the kernel type, pick-ing a 2nd degree polynomial kernel for the countand TypeDM spaces, and a linear one for the pre-dict space (alternatives: RBF and 1st, 2nd or 3rd de-gree polynomials). The choice for the count spaceis consistent with Turney and Mohammad (2014).For BDSM, we tune H (dimensionality of Booleanvectors), setting it to 100 for count, 1,000 for pre-dict and 500 for TypeDM (alternatives: 10, 100,500, 1,000 and 1,500) and the sparsity parameterρ, picking 0.5 for count, 0.75 for predict, and 0.25for TypeDM (alternatives: 0.01, 0.05, 0.1, 0.25, 0.5,0.75). For BDSM and the asymmetric similaritymeasures, we also tune the e threshold above whicha pair is treated as entailing for each dataset.

The γ (RBF kernel radius) and C (margin slack-ness) parameters of SVM and the λ, β and t param-eters of BDSM (see Section 3) are set by maximiz-ing accuracy on LEDS-dev for all lexical entailmentexperiments. For sentence entailment, we tune thesame parameters on SICK-dev. In this case, giventhe imbalance between positive and negative pairs,we maximize weighted accuracy (that is, we counteach true negative as (|pos| + |neg|)/2|neg|, andeach true positive as (|pos| + |neg|)/2|pos|, where|class| is the cardinality of the relevant class in thetuning data).

Finally, for lexical entailment, we train the SVMand BDSM weights by maximizing accuracy onLEDS-core. For LEDS-core and LEDS-dir evalu-ation, we use 10-fold validation. When evaluatingon the BLESS benchmarks, we train on full LEDS-core, excluding any pairs also present in BLESS.For sentential entailment, the models are trained by

381

Page 8: Deriving Boolean structures from distributional vectorsark/EMNLP-2015/proceedings/EMNLP... · 2015-09-03 · Deriving Boolean structures from distributional vectors German Kruszewski

modelLEDS BLESS

core dir coord mero

count

clarkeDE 77 63 27 36

weedsPrec 79 75 27 33

cosWeeds 79 63 26 35

invCL 77 63 27 36

balAPinc 79 66 26 36

SVM (count) 84 90 55 57

BDSM (count) 83 87 53 55

predict

SVM (predict) 71 85 70 55

BDSM (predict) 80 79 76 68

TypeDM

SVM (TypeDM) 78 83 56 60

BDSM (TypeDM) 83 71 31 59

Table 3: Percentage accuracy (LEDS) and F1(BLESS) on the lexical entailment benchmarks.

maximizing weighted accuracy on SICK-train.

5 Lexical entailment

Table 3 reports lexical entailment results (percent-age accuracies for the LEDS benchmarks, F1 scoresfor the unbalanced BLESS sets). We observe, firstof all, that SVM and BDSM are clearly outperform-ing the asymmetric similarity measures in all tasks.In only one case the lowest performance attainedby a supervised model drops below the level ofthe best asymmetric measure performance (BDSMusing TypeDM on LEDS-dir).8 The performanceof the unsupervised measures, which rely most di-rectly on the original distributional space, confirmsthat the latter is more suited to capture similar-ity than entailment. This is shown by the drop inperformance from LEDS-core (where many nega-tive examples are semantically unrelated) to LEDS-dir (where items in positive and negative pairs areequally similar), as well as by the increase fromBLESS-coord to BLESS-mero (as coordinate neg-

8We also inspected ROC curves for BDSM (count) and theasymmetric measures, to check that the better performance ofBDSM was not due to a brittle e (entailment threshold). Thecurves confirmed that, for all tasks, BDSM is clearly dominat-ing all asymmetric measures across the whole e range.

ative examples are more tightly related than holo-meronym pairs).

In the count input space, SVM and BDSM per-form similarly across all 4 tasks, with SVM havinga small edge. In the next sections, we will thus focuson count vectors, for the fairest comparison betweenthe two models. BDSM reaches the most consistentresults with predict vectors, where it performs par-ticularly well on BLESS, and not dramatically worsethan with count vectors on LEDS. On the other hand,predict vectors have a negative overall impact onSVM in 3 over 4 tasks. Concerning the interac-tion of input representations and tasks, we observethat count vectors work best with LEDS, whereasfor BLESS predict vectors are the best choice, re-gardless of the supervised method employed.

Confirming the results of Baroni et al. (2014b),the TypeDM vectors are not a particularly goodchoice for either model. BDSM is specificallynegatively affected by this choice in the LEDS-dirand BLESS-coord tasks. The tight taxonomic in-formation captured by a dependency-based modelsuch as TypeDM might actually be detrimental intasks that require distinguishing between closely re-lated forms, such as coordinates and hypernyms inBLESS-coord.

In terms of relative performance of the super-vised entailment models, if one was to weigh eachtask equally, the best average performance wouldbe reached by BDSM trained on predict vectors,with an average score of 75.75, followed by SVMon count vectors, with an average score of 71.5.We assess the significance of the difference be-tween supervised models trained on the input vec-tors that give the best performance for each task bymeans paired t-tests on LEDS and McNemar testson BLESS. SVM with count vectors is better thanBDSM on LEDS-core (not significant) and LEDS-dir (p<0.05). On the other hand, BDSM with pre-dict vectors is better than SVM on BLESS-coord(p<0.001) and BLESS-mero (p<0.001). We con-clude that, overall, the two models perform similarlyon lexical entailment tasks.

5.1 Learning efficiencyWe just observed that SVM and BDSM have similarlexical entailment performance, especially in countspace. However, the two models are radically differ-

382

Page 9: Deriving Boolean structures from distributional vectorsark/EMNLP-2015/proceedings/EMNLP... · 2015-09-03 · Deriving Boolean structures from distributional vectors German Kruszewski

Figure 3: Average LEDS-core accuracy using countvectors in function of training set size.

ent in their structure. SVM fits a 2nd order polyno-mial separating entailing from non-entailing pairs ina space formed by the concatenation of their distri-butional representations. BDSM, on the other hand,finds a linear transformation into a space where fea-tures of the antecedent are included in those of theconsequent. We conjecture that the latter has muchlarger bias, imposed by this strict subsective con-straint.9 We expect this bias to help learning, by lim-iting the search space and allowing the algorithm toharness training data in a more efficient way. Thus,BDSM should be better at learning with less data,where SVM will be prone to overfitting. To test thisclaim, we measured the cross-validated LEDS-coreaccuracy obtained from using vectors in count spacewhen reducing the training items in steps of 182pairs. The results can be seen in Figure 3. As ex-pected, BDSM scales down much more gracefully,with accuracy well above 70% with as little as 182training pairs.

6 Sentence entailment

Having shown in the previous experiments that theasymmetric measures are not competitive, we focushere on SVM and BDSM. As mentioned above inSection 5, we use count vectors for a fair compari-son between the two models, based on their similarperformance on the lexical benchmarks.

Recall that for sentence entailment we use the9Mitchell (1980) defines bias as any basis for choosing one

generalization over another, other than strict consistency withthe observed training instances.

same hyperparameters as for the lexical tasks, thatthe model constants were tuned on SICK-dev, andthe model weights on SICK-train (details in Section4.4 above). Sentence representations are derived ei-ther with the plf approach, that returns sentence vec-tors built according to syntactic structure, or the ad-ditive (add) method, where constituent word vectorsare simply summed to derive a sentence vector (seeSection 4.1 above).

We compare SVM and BDSM to the Sycophan-tic baseline classifying all pairs as entailing and toa Majority baseline classifying everything as non-entailing. The Word Overlap method (WO) calcu-lates the number of words in common between twosentences and classifies them as entailing wheneverthe ratio is above a certain threshold (calibrated onSICK-train).

Results are given in Table 4. Because of class un-balance, F1 is more informative than accuracy (theMajority baseline reaches the best accuracy with 0precision and recall), so we focus on the formerfor analysis. We observe first that sentence vec-tors obtained with the additive model are consis-tently outperforming the more sophisticated plf ap-proach. This confirms the results of Blacoe and La-pata (2012) on the effectiveness of simple composi-tion methods. We leave it to further studies to de-termine to what extent this can be attributed to spe-cific characteristics of SICK that make word orderinformation redundant, and to what extent it indi-cates that plf is not exploiting syntactic informationadequately (note that Paperno et al. (2014) reportminimal performance differences between additiveand plf for their msrvid benchmark, that is the clos-est to SICK).

Coming now to the crucial comparison of BDSMagainst SVM (focusing on the results obtained withthe additive method), BDSM emerges as the bestclassifier when evaluated alone, improving overSVM, although the difference is not significant.Since the Word Overlap method is performing quitewell (better than SVM) and the surface informationused by WO should be complementary to the se-mantic cues exploited by the vector-based models,we built combined classifiers by training SVMs (onSICK-dev) with linear kernels and WO value pluseach method’s score (BI for BDSM and distanceto the margin for SVM) as features. The combi-

383

Page 10: Deriving Boolean structures from distributional vectorsark/EMNLP-2015/proceedings/EMNLP... · 2015-09-03 · Deriving Boolean structures from distributional vectors German Kruszewski

model P R F1 ASycophantic 29 100 45 29Majority 0 0 0 71WO 40 86 55 60SVM (add) 47 54 51 70BDSM (add) 48 74 58 69SVM (plf) 39 45 42 64BDSM (plf) 44 71 55 66SVM(add) + WO 44 82 58 65BDSM(add) + WO 48 80 60 69SVM(plf) + WO 42 76 54 63BDSM(plf) + WO 42 77 54 63

Table 4: SICK results (percentages).

nations improve performance for both models andBDSM+WO attains the best overall F1 score, beingstatistically superior to both SVM+WO (p<0.001)and WO alone (p < 0.001) (statistical significancevalues obtained through McNemar tests).

We repeated the training data reduction experi-ment from Section 5.1 by measuring cross-validatedF1 scores for SICK (with additive composition). Weconfirmed that BDSM is robust to decreasing theamount of training data, maintaining an F1 score of56 with only 942 training items, whereas, with thesame amount of training data, SVM drops to a F1 of42.

7 Understanding Boolean vectors

BDSM produces representations that are meant torespect inclusion and be interpretable. We turn nowto an extended analysis of the learned representa-tions (focusing on those derived from count vectors),showing first how BDSM activation correlates withgenerality and abstractness, and then how similarityin BDSM space points in the direction of an exten-sional interpretation of Boolean units.

7.1 Boolean dimensions and generalityThe BDSM layer is trained to assign more activa-tion to a hypernym than its hyponyms (the hyper-nym units should include the hyponyms’ ones), sothe more general (that is, higher on the hypernymyscale) a concept is, the higher the proportion of acti-vated units in its BDSM vector. The words that acti-vate all nodes should be implied by all other terms.Indeed, very general words such as thing(s), every-thing, and anything have Boolean vectors with all 1s.

But there are also other words (a total of 768) map-ping to the top element of the Boolean algebra (avector of all 1s), including reduction, excluded, re-sults, benefit, global, extent, achieve. The collapsingof these latter terms must be due to a combination oftwo factors: low dimensionality of Boolean space,10

and the fact that the model was trained on a limitedvocabulary, mostly consisting of concrete nouns, sothere was simply no training evidence to character-ize abstract words such as benefit in a more nuancedway.

Still, we predict that the proportion of Boolean di-mensions that a word activates (i.e., dimensions withvalue 1) should correspond, as a trend, to its degreeof semantic generality. More general concepts alsotend to be more abstract, so we also expect a cor-relation between Boolean activation and the wordrating on the concrete-abstract scale.11 To evalu-ate these claims quantitatively, we rely on WordNet(Fellbaum, 1998), which provides an is-a hierar-chy of word senses (‘synsets’) that can be used tomeasure semantic generality. We compute the aver-age length of a path from the root of the hierarchyto the WordNet synsets of a word (shortest is mostgeneral, so that a higher depth score corresponds toa more specific concept). We further use the Ghentdatabase (Brysbaert et al., 2013), that contains 40KEnglish words rated on a 1-5 scale from least to mostconcrete (as expected, depth and concreteness arecorrelated, ρ = .54).

Boolean vector activation significantly correlateswith both variables (ρ=-18 with depth, ρ=-30 withconcreteness; these and all correlations below sig-nificant at p < 0.005). Moreover, the BDSM acti-vations are much higher than those achieved by dis-tributional vector L1 norm (which, surprisingly, haspositive correlations: ρ=13 with depth, ρ=21 withconcreteness) and word frequency (ρ=-2 with depth,ρ=4 with concreteness).

We visualize how Boolean activation correlateswith generality in Figure 4. We plot the twoexample words car and newspaper together withtheir 30 nearest nominal neighbours in distributional

10With count input representations, our tuning favoured rela-tively dense 100-dimensional vectors (see Section 4.4).

11Automatically determining the degree of abstractness ofconcepts is a lively topic of research (Kiela et al., 2014; Tur-ney et al., 2011).

384

Page 11: Deriving Boolean structures from distributional vectorsark/EMNLP-2015/proceedings/EMNLP... · 2015-09-03 · Deriving Boolean structures from distributional vectors German Kruszewski

Figure 4: Boolean activation (percentage of positivedimensions) of the 30 nearest distributional neigh-bours of car and newspaper.

space,12 sorting them from most to least activated.More general words do indeed cluster towards thetop, while more specific words are pushed to the bot-tom. Interestingly, while vehicle and organizationwere present in the training data, that was not thecase for media or press. Moreover, the training datadid not contain any specific type of car (like volvo orsuv) or newspaper (tribune or tabloid).

7.2 Similarity in Boolean space

From the model-theoretical point of view, word-to-BDSM mapping provides an interpretation func-tion in the logical sense, mapping linguistic expres-sions to elements of the model domain (Boolean di-mensions). If distributional vectors relate to con-cepts in a (hyper)intensional construal (Erk, 2013),Boolean vectors could encode their (possible) ex-tensions along the lines suggested by Grefenstette(2013), with vector dimensions corresponding to en-tities in the domain of discourse.13 Under the exten-

12Due to tagging errors, the neighbors also include someverbs like parked or adjectives like weekly.

13In fact everything we say here applies equally well to cer-tain intensional interpretations of Boolean vectors. For exam-ple, the atoms of the Boolean algebra could correspond not toentities in the actual world but to classes of individuals across

sional interpretation, the Boolean vector of a wordencodes the set of objects in the word extension.But of course, given that our BDSM implementa-tion operates with only 100 dimensions, one can-not expect such an extensional interpretation of themodel to be realistic. Still, the extensional interpre-tation of the Boolean model, while being highly ide-alized, makes some testable predictions. Under thisview, synonyms should have identical Boolean vec-tors, antonyms should have disjoint vectors. Com-patible terms (including hyponym-hypernym pairs)should overlap in their 1s. Cohyponyms, while highon the relatedness scale, should have low “exten-sional similarity”; singer and drummer are very re-lated notions but the intersection of their extensionsis small, and that between alligator and crocodileis empty (in real life, no entity is simultaneously acrocodile and an alligator).

As expected, the straightforward interpretation ofdimensions as individuals in a possible world closeto ours is contradicted by many counterexamplesin the present BDSM implementation. For exam-ple, the nouns man and woman have a considerableoverlap in activated Boolean dimensions, while inany plausible world hermaphrodite humans are rare.Still, compared to distributional space, BDSM goesin the direction of an extensional model as discussedabove. To quantify this difference, we comparedthe similarity scores (cosines) produced by the twomodels. Specifically, we first created a list of pairsof semantically related words using the followingprocedure. We took the 10K most frequent wordspaired with their 10 closest neighbors in the countdistributional space. We then filtered them to beof “medium frequency” (both words must lie withinthe 60K-90K frequency range in our 2.8B token cor-pus). One of the authors annotated the resulting 624pairs as belonging to one of the following types: co-hyponyms (137, e.g., AIDS vs. diabetes); derivation-ally related words (10, e.g., depend vs. dependent);hypernym-hyponym pairs (37, e.g., arena vs. the-ater); personal names (97, e.g., Adams vs. Harris);synonyms (including contextual ones; 49, e.g., abil-ities vs. skill); or “other” (294, e.g., actress vs. star-ring), if the pair does not fit any of the above types

possible worlds. Alternatively, one can think of the atoms as“typical cases” rather than actual individuals, or even as typicalproperties of the relevant individuals.

385

Page 12: Deriving Boolean structures from distributional vectorsark/EMNLP-2015/proceedings/EMNLP... · 2015-09-03 · Deriving Boolean structures from distributional vectors German Kruszewski

(some relations of interest, such as antonymy, wereexcluded from further analysis as they were instan-tiated by very few pairs). Since cosines have differ-ent distributions in distributional (DS) and Booleanspace (BS), we z-normalized them before compar-ing those of pairs of the same type across the twospaces.

Under the extensional interpretation, we expectco-hyponyms to go apart after Boolean mapping, asthey should in general have little extensional over-lap. Indeed they have significantly lower cosinesin BS than DS (p < 0.001; paired t-test). As ex-pected under the extensional interpretation, personalnames are very significantly less similar in BS thanDS (p < 0.001). Synonyms and hypo/hypernymshave significant denotational overlap, and they movecloser to each other after mapping. Specifically,synonyms significantly gain in similarity betweenBS and DS (p<0.01), whereas hyponym-hypernympairs, while not differing significantly in averagesimilarity across the spaces, change from beingweakly significantly lower in cosine than all otherpairs in DS (p < 0.05) to being indistinguishablefrom the other pairs in BS. Derivationally relatedwords gain in similarity (p<0.01) collapsing to al-most identical vectors after Boolean mapping. Thisdeserves a special comment. Although words inthese pairs typically belong to different parts ofspeech and are not synonyms in the usual sense, onecould interpret them as denotational synonyms inthe sense that they get reference in the same situa-tions. Taking two word pairs from our data as exam-ples, the existence of experiments entails the pres-ence of something experimental, anything Islamicentails the presence of Islam in the situation, etc. Ifso, the fact that derivationally related words collapseunder Boolean mapping makes perfect sense fromthe viewpoint of denotational overlap.

8 Conclusion

We introduced BDSM, a method that extracts repre-sentations encoding semantic properties relevant formaking inferences from distributional semantic vec-tors. When applied to the task of detecting entail-ment between words or sentences, BDSM is com-petitive against a state-of-the-art SVM classifier, andneeds less learning data to generalize. In contrast to

SVM, BDSM is transparent: we are able not only toclassify a pair of words (or sentences) with respect toentailment, but we also produce a compact Booleanvector for each word, that can be used alone for rec-ognizing its entailment relations. Besides the anal-ogy with the structures postulated in formal seman-tics, this can be important for practical applicationsthat involve entailment recognition, where Booleanvectors can reduce memory and computing powerrequirements.

The Boolean vectors also allow for a certaindegree of interpretability, with the number of ac-tive dimensions correlating with semantic generalityand abstractness. Qualitative analysis suggests thatBoolean mapping moves the semantic space fromone organized around word relatedness towards adifferent criterion, where vectors of two words arecloser to each other whenever their denotations havegreater overlap. This is, however, just a tendency.Ideally, the overlap between dimensions of two vec-tors should be a measure of compatibility of con-cepts. In future research, we would like to explore towhat extent one can reach this ideal, explicitly teach-ing the network to also capture other types of rela-tions (e.g., no overlap between cohyponym repre-sentations), and using alternative learning methods.

We also want to look within the same frameworkat other phenomena, such as negation and conjunc-tion, that have an elegant treatment in formal seman-tics but are currently largely outside the scope of dis-tributional approaches.

Acknowledgements

We thank Yoav Goldberg, Omer Levy, Ivan Titov,Tim Rocktaschel, the members of the COM-POSES team (especially Angeliki Lazaridou) andthe FLOSS reading group. We also thank Alexan-der Koller and the anonymous reviewers for theirinsightful comments. We acknowledge ERC 2011Starting Independent Research Grant n. 283554(COMPOSES)

ReferencesMarco Baroni and Alessandro Lenci. 2010. Dis-

tributional Memory: A general framework forcorpus-based semantics. Computational Linguistics,36(4):673–721.

386

Page 13: Deriving Boolean structures from distributional vectorsark/EMNLP-2015/proceedings/EMNLP... · 2015-09-03 · Deriving Boolean structures from distributional vectors German Kruszewski

Marco Baroni and Alessandro Lenci. 2011. How weBLESSed distributional semantic evaluation. In Pro-ceedings of the EMNLP GEMS Workshop, pages 1–10,Edinburgh, UK.

Marco Baroni, Raffaella Bernardi, Ngoc-Quynh Do, andChung-Chieh Shan. 2012. Entailment above the wordlevel in distributional semantics. In Proceedings ofEACL, pages 23–32, Avignon, France.

Marco Baroni, Raffaella Bernardi, and Roberto Zampar-elli. 2014a. Frege in space: A program for compo-sitional distributional semantics. Linguistic Issues inLanguage Technology, 9(6):5–110.

Marco Baroni, Georgiana Dinu, and German Kruszewski.2014b. Don’t count, predict! a systematic compari-son of context-counting vs. context-predicting seman-tic vectors. In Proceedings of ACL, pages 238–247,Baltimore, MD.

Islam Beltagy, Cuong Chau, Gemma Boleda, Dan Gar-rette, Katrin Erk, and Raymond Mooney. 2013. Mon-tague meets Markov: Deep semantics with probabilis-tic logical form. In Proceedings of *SEM, pages 11–21, Atlanta, GA.

William Blacoe and Mirella Lapata. 2012. A comparisonof vector-based representations for semantic composi-tion. In Proceedings of EMNLP, pages 546–556, JejuIsland, Korea.

Antoine Bordes, Xavier Glorot, Jason Weston, andYoshua Bengio. 2012. Joint learning of words andmeaning representations for open-text semantic pars-ing. In Proceedings of AISTATS, pages 127–135, LaPalma, Canary Islands.

Samuel R. Bowman. 2013. Can recursive neuraltensor networks learn logical reasoning? CoRR,abs/1312.6192.

Marc Brysbaert, Amy Beth Warriner, and Victor Kuper-man. 2013. Concreteness ratings for 40 thousand gen-erally known english word lemmas. Behavior researchmethods, pages 1–8.

Gennaro Chierchia and Sally McConnell-Ginet. 2000.Meaning and Grammar: An Introduction to Seman-tics. MIT Press, Cambridge, MA.

Kenneth Church and Peter Hanks. 1990. Word asso-ciation norms, mutual information, and lexicography.Computational Linguistics, 16(1):22–29.

Stephen Clark and Stephen Pulman. 2007. Combiningsymbolic and distributional models of meaning. InProceedings of the First Symposium on Quantum In-teraction, pages 52–55, Stanford, CA.

Daoud Clarke. 2012. A context-theoretic framework forcompositionality in distributional semantics. Compu-tational Linguistics, 38(1):41–71.

Bob Coecke, Mehrnoosh Sadrzadeh, and Stephen Clark.2010. Mathematical foundations for a compositional

distributional model of meaning. Linguistic Analysis,36:345–384.

Nello Cristianini and John Shawe-Taylor. 2000. Anintroduction to Support Vector Machines and otherkernel-based learning methods. Cambridge Univer-sity Press, Cambridge, UK.

Ido Dagan, Bill Dolan, Bernardo Magnini, and Dan Roth.2009. Recognizing textual entailment: rationale, eval-uation and approaches. Natural Language Engineer-ing, 15:459–476.

Katrin Erk. 2009. Supporting inferences in semanticspace: Representing words as regions. In Proceedingsof IWCS, pages 104–115, Tilburg, Netherlands.

Katrin Erk. 2013. Towards a semantics for distribu-tional representations. In Proceedings of IWCS, pages95–106, Potsdam, Germany. Association for Compu-tational Linguistics.

Christiane Fellbaum, editor. 1998. WordNet: An Elec-tronic Lexical Database. MIT Press, Cambridge, MA.

Dan Garrette, Katrin Erk, and Ray Mooney. 2013. A for-mal approach to linking logical form and vector-spacelexical semantics. In H. Bunt, J. Bos, and S. Pul-man, editors, Computing Meaning, Vol. 4, pages 27–48. Springer, Berlin.

Maayan Geffet and Ido Dagan. 2005. The distributionalinclusion hypotheses and lexical entailment. In Pro-ceedings of ACL, pages 107–114, Ann Arbor, MI.

Edward Grefenstette. 2013. Towards a formal distribu-tional semantics: Simulating logical calculi with ten-sors. Proceedings of *SEM, pages 1–10.

Rodolphe Jenatton, Nicolas Le Roux, Antoine Bordes,and Guillaume Obozinski. 2012. A latent factormodel for highly multi-relational data. In Proceedingsof NIPS, pages 3176–3184, La Palma, Canary Islands.

Douwe Kiela, Felix Hill, Anna Korhonen, and StephenClark. 2014. Improving multi-modal representationsusing image dispersion: Why less is sometimes more.In Proceedings of ACL, pages 835–841, Baltimore,MD.

Lili Kotlerman, Ido Dagan, Idan Szpektor, and MaayanZhitomirsky-Geffet. 2010. Directional distributionalsimilarity for lexical inference. Natural Language En-gineering, 16(4):359–389.

Thomas Landauer and Susan Dumais. 1997. A solu-tion to Plato’s problem: The latent semantic analysistheory of acquisition, induction, and representation ofknowledge. Psychological Review, 104(2):211–240.

Alessandro Lenci and Giulia Benotto. 2012. Identifyinghypernyms in distributional semantic spaces. In Pro-ceedings of *SEM, pages 75–79, Montreal, Canada.

Mike Lewis and Mark Steedman. 2013. Combineddistributional and logical semantics. Transactions ofthe Association for Computational Linguistics, 1:179–192.

387

Page 14: Deriving Boolean structures from distributional vectorsark/EMNLP-2015/proceedings/EMNLP... · 2015-09-03 · Deriving Boolean structures from distributional vectors German Kruszewski

Marco Marelli, Luisa Bentivogli, Marco Baroni, Raf-faella Bernardi, Stefano Menini, and Roberto Zam-parelli. 2014a. SemEval-2014 Task 1: Evaluationof compositional distributional semantic models onfull sentences through semantic relatedness and tex-tual entailment. In Proceedings of SemEval, pages 1–8, Dublin, Ireland.

Marco Marelli, Stefano Menini, Marco Baroni, LuisaBentivogli, Raffaella bernardi, and Roberto Zampar-elli. 2014b. A SICK cure for the evaluation of com-positional distributional semantic models. In Proceed-ings of LREC, pages 216–223, Rekjavik, Iceland.

Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013. Efficient estimation of word representa-tions in vector space. http://arxiv.org/abs/1301.3781/.

Jeff Mitchell and Mirella Lapata. 2010. Composition indistributional models of semantics. Cognitive Science,34(8):1388–1429.

Tom M Mitchell. 1980. The need for biases in learninggeneralizations. Technical report, New Brunswick,NJ.

Denis Paperno, Nghia The Pham, and Marco Baroni.2014. A practical and linguistically-motivated ap-proach to compositional distributional semantics. InProceedings of ACL, pages 90–99, Baltimore, MD.

Tamara Polajnar, Luana Fagarasan, and Stephen Clark.2014. Reducing dimensions of tensors in type-drivendistributional semantics. In Proceedings of EMNLP,pages 1036–1046, Doha, Qatar.

Laura Rimell. 2014. Distributional lexical entailment bytopic coherence. In Proceedings of EACL, pages 511–519, Gothenburg, Sweden.

Tim Rocktaschel, Sameer Singh, and Sebastian Riedel.2015. Injecting logical background knowledge intoembeddings for relation extraction. In Proceedings ofHLT-NAACL, pages 1119–1129, Denver, CO.

Stephen Roller, Katrin Erk, and Gemma Boleda. 2014.Inclusive yet selective: Supervised distributional hy-pernymy detection. In Proceedings of COLING, pages1025–1036, Dublin, Ireland.

Enrico Santus, Alessandro Lenci, Qin Lu, and SabineSchulte im Walde. 2014. Chasing hypernyms in vec-tor spaces with entropy. In Proceedings of EACL,pages 38–42, Gothenburg, Sweden.

T. Sasao. 1999. Switching Theory for Logic Synthesis.Springer, London.

Hinrich Schutze. 1997. Ambiguity Resolution in NaturalLanguage Learning. CSLI, Stanford, CA.

Richard Socher, Danqi Chen, Christopher D. Manning,and Andrew Y. Ng. 2013. Reasoning with neural ten-sor networks for knowledge base completion. In Pro-ceedings of NIPS, pages 926–934, Lake Tahoe, NV.

Peter Turney and Said Mohammad. 2014. Experimentswith three approaches to recognizing lexical entail-ment. Natural Language Engineering. In press.

Peter Turney and Patrick Pantel. 2010. From frequencyto meaning: Vector space models of semantics. Jour-nal of Artificial Intelligence Research, 37:141–188.

Peter Turney, Yair Neuman, Dan Assaf, and Yohai Co-hen. 2011. Literal and metaphorical sense identifi-cation through concrete and abstract context. In Pro-ceedings of EMNLP, pages 680–690, Edinburgh, UK.

Peter Turney. 2012. Domain and function: A dual-spacemodel of semantic relations and compositions. Jour-nal of Artificial Intelligence Research, 44:533–585.

Eva Maria Vecchi, Marco Baroni, and Roberto Zampar-elli. 2011. (Linear) maps of the impossible: Cap-turing semantic anomalies in distributional space. InProceedings of the ACL Workshop on DistributionalSemantics and Compositionality, pages 1–9, Portland,OR.

Julie Weeds, David Weir, and Diana McCarthy. 2004.Characterising measures of lexical distributional simi-larity. In Proceedings of COLING, pages 1015–1021,Geneva, Switzerland.

Julie Weeds, Daoud Clarke, Jeremy Reffin, David Weir,and Bill Keller. 2014. Learning to distinguish hyper-nyms and co-hyponyms. In Proceedings of COLING,pages 2249–2259, Dublin, Ireland.

Peter Young, Alice Lai, Micah Hodosh, and Julia Hock-enmaier. 2014. From image descriptions to visualdenotations: New similarity metrics for semantic in-ference over event descriptions. Transactions of theAssociation for Computational Linguistics, 2:67–78.

388