Neural Compositional Denotational Semantics for …leading to a denotation for a phrase of type tare merged into an expected denotation, allowing dy-namic programming (x4). (d) Answering

Neural Compositional Denotational Semantics for Question Answering

Nitish Gupta∗University of Pennsylvania

Philadelphia, [email protected]

Mike LewisFacebook AI Research

Seattle, [email protected]

AbstractAnswering compositional questions requiringmulti-step reasoning is challenging. We intro-duce an end-to-end differentiable model for in-terpreting questions about a knowledge graph(KG), which is inspired by formal approachesto semantics. Each span of text is repre-sented by a denotation in a KG and a vec-tor that captures ungrounded aspects of mean-ing. Learned composition modules recursivelycombine constituent spans, culminating in agrounding for the complete sentence which an-swers the question. For example, to interpret“not green”, the model represents “green” as aset of KG entities and “not” as a trainable un-grounded vector—and then uses this vector toparameterize a composition function that per-forms a complement operation. For each sen-tence, we build a parse chart subsuming allpossible parses, allowing the model to jointlylearn both the composition operators and out-put structure by gradient descent from end-task supervision. The model learns a variety ofchallenging semantic operators, such as quanti-fiers, disjunctions and composed relations, andinfers latent syntactic structure. It also gen-eralizes well to longer questions than seen inits training data, in contrast to RNN, its tree-based variants, and semantic parsing baselines.

1 Introduction

Compositionality is a mechanism by which themeanings of complex expressions are systemati-cally determined from the meanings of their parts,and has been widely assumed in the study of bothartificial and natural languages (Montague, 1973)as a means for allowing speakers to generalize tounderstanding an infinite number of sentences. Pop-ular neural network approaches to question answer-ing use a restricted form of compositionality, typi-cally encoding a sentence word-by-word, and then

∗Work done while interning with Facebook AI Research.

executing the complete sentence encoding againsta knowledge source (Perez et al., 2017). Such mod-els can fail to generalize from training data in sur-prising ways. Inspired by linguistic theories ofcompositional semantics, we instead build a latenttree of interpretable expressions over a sentence,recursively combining constituents using a smallset of neural modules. Our model outperformsRNN encoders, particularly when test questionsare longer than training questions.

Our approach resembles Montague semantics,in which a tree of interpretable expressions is builtover the sentence, with nodes combined by a smallset of composition functions. However, both thestructure of the sentence and the composition func-tions are learned by end-to-end gradient descent.To achieve this, we define the parametric form ofsmall set of composition modules, and then build aparse chart over each sentence subsuming all pos-sible trees. Each node in the chart represents aspan of text with a distribution over groundings(in terms of booleans and knowledge base nodesand edges), as well as a vector representing aspectsof the meaning that have not yet been grounded.The representation for a node is built by taking aweighted sum over different ways of building thenode (similar to Maillard et al. (2017)). The treesinduced by our model are linguistically plausible,in contrast to prior work on structure learning fromsemantic objectives (Williams et al., 2018).

Typical neural approaches to grounded questionanswering first encode a question with a recur-rent neural network (RNN), and then evaluate theencoding against an encoding of the knowledgesource (for example, a knowledge graph or image)(Santoro et al., 2017). In contrast to classical ap-proaches to compositionality, constituents of com-plex expressions are not given explicit interpreta-tions in isolation. For example, in Which cubes arelarge or green?, an RNN encoder will not explic-

arX

iv:1

808.

0994

2v1

[cs

.CL

] 2

9 A

ug 2

018

what is left of a red thing or not cylindricalEV

E

EV

E

E

R

EE

VE

EE

�

EE

above

above

above

left

above

left

left

left

��

Figure 1: A correct parse for a question given the knowledge graph on the right, using our model. Weshow the type for each node, and its denotation in terms of the knowledge graph. The words or and notare represented by vectors, which parameterize composition modules. The denotation for the completequestion represents the answer to the question. Nodes here have typesE for sets of entities, R for relations,V for ungrounded vectors, EV for a combination of entities and a vector, and φ for semantically vacuousnodes. While we show only one parse tree here, our model builds a parse chart subsuming all trees.

itly build an interpretation for the phrase large orgreen. We show that such approaches can general-ize poorly when tested on more complex sentencesthan they were trained on. Our approach insteadimposes independence assumptions that give a lin-guistically motivated inductive bias. In particular,it enforces that phrases are interpreted indepen-dently of surrounding words, allowing the modelto generalize naturally to interpreting phrases indifferent contexts. In our model, large or greenwill be represented as a particular set of entities ina knowledge graph, and be intersected with the setof entities represented by the cubes node.

Another perspective on our work is as a methodfor learning layouts of Neural Module Networks(NMNs) (Andreas et al., 2016b). Work on NMNshas focused on construction of the structure of thenetwork, variously using rules, parsers and rein-forcement learning (Andreas et al., 2016a; Hu et al.,2017). Our end-to-end differentiable model jointlylearns structures and modules by gradient descent.

Our model is a new combination of classical andneural methods, which maintains the interpretabil-ity and generalization behaviour of semantic pars-ing, while being end-to-end differentiable.

2 Model Overview

Our task is to answer a question q = w1..|q|, withrespect to a Knowledge Graph (KG) consistingof nodes E (representing entities) and labelled di-rected edgesR (representing relationship betweenentities). In our task, answers are either booleans,or specific subsets of nodes from the KG.

Our model builds a parse for the sentence, in

which phrases are grounded in the KG, and a smallset of composition modules are used to combinephrases, resulting in a grounding for the completequestion sentence that answers it. For example,in Figure 1, the phrases not and cylindrical areinterpreted as a function word and an entity set,respectively, and then not cylindrical is interpretedby computing the complement of the entity set. Thenode at the root of the parse tree is the answer tothe question. Our model answers questions by:

(a) Grounding individual tokens in a KG, thatcan either be grounded as particular sets of entitiesand relations in the KG, as ungrounded vectors, ormarked as being semantically vacuous. For eachword, we learn parameters that are used to computea distribution over semantic types and correspond-ing denotations in a KG (§ 3.1).

(b) Combining representations for adjacentphrases into representations for larger phrases, us-ing trainable neural composition modules (§ 3.2).This produces a denotation for the phrase.

(c) Assigning a binary-tree structure to the ques-tion sentence, which determines how words aregrounded, and which phrases are combined usingwhich modules. We build a parse chart subsumingall possible structures, and train a parsing modelto increase the likelihood of structures leading tothe correct answer to questions. Different parsesleading to a denotation for a phrase of type t aremerged into an expected denotation, allowing dy-namic programming (§ 4).

(d) Answering the question, with the most likelygrounding of the phrase spanning the sentence.

3 Compositional Semantics

3.1 Semantic Types

Our model classifies spans of text into different se-mantic types to represent their meaning as explicitdenotations, or ungrounded vectors. All phrases areassigned a distribution over semantic types. The se-mantic type determines how a phrase is grounded,and which composition modules can be used tocombine it with other phrases. A phrase spanningwi..j has a denotation Jwi..jKtKG for each semantictype t. For example, in Figure 1, red correspondsto a set of entities, left corresponds to a set of rela-tions, and not is treated as an ungrounded vector.

The semantic types we define can be classifiedinto three broad categories.

Grounded Semantic Types: Spans of text thatcan be fully grounded in the KG.

1. Entity (E): Spans of text that can be groundedto a set of entities in the KG, for example, redsphere or large cube. E-type span groundingis represented as an attention value for each en-tity, [pe1 , . . . , pe|E| ], where pei ∈ [0, 1]. Thiscan be viewed as a soft version of a logicalset-valued denotation, which we refer to as asoft entity set.

2. Relation (R): Spans of text that can begrounded to set of relations in the KG, forexample, left of or not right of or above. R-type span grounding is represented by a softadjacency matrixA ∈ R|E|×|E| whereAij = 1denotes a directed edge from ei → ej .

3. Truth (T): Spans of text that can be groundedwith a Boolean denotation, for example, Isanything red?, Is one ball green and are nocubes red?. T-type span grounding is repre-sented using a real-value ptrue ∈ [0, 1] thatdenotes the probability of the span being true.

Ungrounded Semantic Types: Spans of textwhose meaning cannot be grounded in the KG.

1. Vector (V): This type is used for spans repre-senting functions that cannot yet be groundedin the KG (e.g. words such as and or every).These spans are represented using 4 differentreal-valued vectors v1-v4 ∈ R2-R5, that areused to parameterize the composition mod-ules described in §3.2.

2. Vacuous (φφφ): Spans that are considered se-mantically vacuous, but are necessary syntac-tically, e.g. of in left of a cube. During com-position, these nodes act as identity functions.

Partially-Grounded Semantic Types: Spans oftext that can only be partially grounded in theknowledge graph, such as and red or are fourspheres. Here, we represent the span by a com-bination of a grounding and vectors, representinggrounded and ungrounded aspects of meaning re-spectively. The grounded component of the repre-sentation will typically combine with another fullygrounded representation, and the ungrounded vec-tors will parameterize the composition module. Wedefine 3 semantic types of this kind: EV, RV andTV, corresponding to the combination of entities,relations and boolean groundings respectively withan ungrounded vector. Here, the word representedby the vectors can be viewed as a binary function,one of whose arguments has been supplied.

3.2 Composition ModulesNext, we describe how we compose phrase repre-sentations (from § 3.1) to represent larger phrases.We define a small set of composition modules, thattake as input two constituents of text with their cor-responding semantic representations (grounded rep-resentations and ungrounded vectors), and outputsthe semantic type and corresponding representationof the larger constituent. The composition modulesare parameterized by the trainable word vectors.These can be divided into several categories:

Composition modules resulting in fullygrounded denotations: Described in Figure 2.

Composition withφφφ-typed nodes: Phrases withtypeφφφ are treated as being semantically transparentidentity functions. Phrases of any other type cancombined with these nodes, with no change to theirtype or representation.

Composition modules resulting in partiallygrounded denotations: We define several mod-ules that combine fully grounded phrases with un-grounded phrases, by deterministically taking theunion of the representations, giving phrases withpartially grounded representations (§ 3.1). Thesemodules are useful when words act as binary func-tions; here they combine with their first argument.For example, in Fig. 1, or and not cylindrical com-bine to make a phrase containing both the vectorsfor or and the entity set for not cylindrical.

large redEE

Epei = �

24

w1

w2

b

35 ·

24

pLei

pRei

1

35!

E + E → E: This module performs a function on a pair ofsoft entity sets, parameterized by the model’s global param-eter vector [w1, w2, b] to produce a new soft entity set. Thecomposition function for a single entity’s resulting attentionvalue is shown. Such a composition module can be used tointerpret compound nouns and entity appositions. For exam-ple, the composition module shown above learns to output theintersection of two entity sets.

not cylindricalEV

Epei

= �

✓v1 ·

pR

ei

1

�◆

V + E→ E: This module performs a function on a soft entityset, parameterized by a word vector, to produce a new softentity set. For example, the word not learns to take the com-plement of a set of entities. The entity attention representationof the resulting span is computed by using the indicated func-tion that takes the v1 ∈ R2 vector of the V constituent as aparameter argument and the entity attention vector of the Econstituent as a function argument.

small or purple

E

EEVpei = �

v2 ·

24

pLei

pRei

1

35!

EV + E→ E: This module combines two soft entity sets intoa third set, parameterized by the v2 word vector. This com-position function is similar to a linear threshold unit and iscapable of modeling various mathematical operations such aslogical conjunctions, disjunctions, differences etc. for differ-ent values of v2. For example, the word or learns to model setunion.

left of a red cube

EER pei

= maxej

Aji · pRej

R + E→ E: This module composes a set of relations (repre-sented as a single soft adjacency matrix) and a soft entity setto produce an output soft entity set. The composition functionuses the adjacency matrix representation of the R-span andthe soft entity set representation of the E-span.

EV

T

is anything cylindrical

True

ptrue = �

v13

"X

ei

�

v33

v43

�·pR

ei

1

�!#+ v2

3

!

V + E→ T: This module maps a soft entity set onto a softboolean, parameterized by word vector (v3). The modulecounts whether a sufficient number of elements are in (or out)of the set. For example, the word any should test if a set isnon-empty.

ptrue = �

v14

"X

ei

�

24

v34

v44

v54

35 ·

24

pLei

pRei

1

35!#

+ v24

!False

EEV

T

is every cylinder blue

EV + E → T: This module combines two soft entity setsinto a soft boolean, which is useful for modelling generalizedquantifiers. For example, in is every cylinder blue, the modulecan use the inner sigmoid to test if an element ei is in theset of cylinders (pLei ≈ 1) but not in the set of blue things(pRei ≈ 0), and then use the outer sigmoid to return a valueclose to 1 if the sum of elements matching this property isclose to 0.

ptrue = �

v2 ·

24

pLtrue

pRtrue

1

35!

are 2 balls red and is every cube blue

T

TTV

True

False

False

TV + T→ T: This module maps a pair of soft booleans intoa soft boolean using the v2 word vector to parameterize thecomposition function. Similar to EV + E→ E, this modulefacilitates modeling a range of boolean set operations. Usingthe same functional form for different composition functionsallows our model to use the same ungrounded word vector(v2) for compositions that are semantically analogous.

Aij = �

v2 ·

24

ALij

ARij

1

35!

left of or above

RRRV

RV + R → R: This module composes a pair of soft set ofrelations to a produce an output soft set of relations. Forexample, the relations left and above are composed by theword or to produce a set of relations such that entities ei andej are related if either of the two relations exists betweenthem. The functional form for this composition is similar toEV + E→ E and TV + T→ T modules.

Figure 2: Composition Modules that compose two constituent span representations into the representationfor the combined larger span, using the indicated equations.

4 Parsing Model

Here, we describe how our model classifies ques-tion tokens into semantic type spans and computestheir representations (§ 4.1), and recursively usesthe composition modules defined above to parsethe question into a soft latent tree that provides theanswer (§ 4.2). The model is trained end-to-endusing only question-answer supervision (§ 4.3).

4.1 Lexical Representation AssignmentEach token in the question sentence is assigneda distribution over the semantic types, and agrounded representation for each type. Tokens canonly be assigned the E, R, V, and φφφ types. Forexample, the token cylindrical in the question inFig. 1 is assigned a distribution over the 4 semantictypes (one shown) and for the E type, its represen-tation is the set of cylindrical entities.

Semantic Type Distribution for Tokens: Tocompute the semantic type distribution, our modelrepresents each word w, and each semantic type tusing an embedding vector; vw, vt ∈ Rd. The se-mantic type distribution is assigned with a softmax:

p(t|wi) ∝ exp(vt · vwi)

Grounding for Tokens: For each of the seman-tic type, we need to compute their representations:

1. E-Type Representation: Each entity e ∈ E , isrepresented using an embedding vector ve ∈Rd based on the concatenation of vectors forits properties. For each token w, we use itsword vector to find the probability of eachentity being part of the E-Type grounding:

pwei = σ(vei · vw) ∀ ei ∈ E

For example, in Fig. 1, the word red will begrounded as all the red entities.

2. R-Type Representation: Each relation r ∈R, is represented using an embedding vectorvr ∈ Rd. For each token wi, we compute adistribution over relations, and then use this tocompute the expected adjacency matrix thatforms the R-type representation for this token.

p(r|wi) ∝ exp(vr · vwi)

Awi =∑

r∈Rp(r|wi) ·Ar

e.g. the word left in Fig. 1 is grounded as thesubset of edges with label ‘left’.

3. V-Type Representation: For each word w ∈V , we learn four vectors v1 ∈ R2, v2 ∈R3, v3 ∈ R4, v4 ∈ R5, and use these as therepresentation for words with the V-Type.

4. φφφ-Type Representation: Semantically vacuouswords that do not require a representation.

4.2 Parsing Questions. To learn the correct structure for applying compo-sition modules, we use a simple parsing model. Webuild a parse-chart over the question encompass-ing all possible trees by applying all compositionmodules, similar to a standard CRF-based PCFGparser using the CKY algorithm. Each node in theparse-chart, for each span wi..j of the question, isrepresented as a distribution over different semantictypes with their corresponding representations.

Phrase Semantic Type Potential (ψti,j): Themodel assigns a score, ψti,j , to each wi..j span,for each semantic type t. This score is computedfrom all possible ways of forming the span wi..jwith type t. For a particular composition of spanwi..k of type t1 and wk+1..j of type t2, using thet1 + t2 → t module, the composition score is:

ψt1+t2→ti,k,j = ψt1i,k · ψt2k+1,j · eθ·ft1+t2→t(i,j,k|q)

where θ is a trainable vector and f t1+t2→t(i, j, k|q)is a simple feature function. Features consist of aconjunction of the composition module type and:the words before (wi−1) and after (wj+1) the span,the first (wi) and last word (wk) in the left con-stituent, and the first (wk+1) and last (wj) word inthe right constituent.The final t-type potential of wi..j is computed bysumming scores over all possible compositions:

ψti,j =

j−1∑

k=i

∑

(t1+t2→t)∈Modules

ψt1+t2→ti,k,j

Combining Phrase Representations (Jwi..jKtKG):To compute wi..j’s t-type denotation, Jwi..jKtKG,we compute an expected output representation fromall possible compositions that result in type t.

Jwi..jKtKG =1

ψti,j

j−1∑

k=i

Jwi..k..jKtKG

Jwi..k..jKtKG =∑

(t1+t2→t)∈Modules

ψt1+t2→ti,k,j ·Jwi..k..jKt1+t2→tKG

where Jwi..jKtKG, is the t-type representation of thespan wi..j and Jwi..k..jKt1+t2→tKG is the representa-tion resulting from the composition of wi..k withwk+1..j using the t1+t2 → t composition module.

Answer Grounding: By recursively computingthe phrase semantic-type potentials and representa-tions, we can infer the semantic type distribution ofthe complete question and the resulting groundingfor different semantic types t, Jw1..|q|KtKG.

p(t|q) ∝ ψ(1, |q|, t) (1)

The answer-type (boolean or subset of entities) forthe question is computed using:

t∗ = argmaxt∈T,E

p(t|q) (2)

The corresponding grounding is Jw1..|q|Kt∗KG, which

answers the question.

4.3 Training ObjectiveGiven a datasetD of (question, answer, knowledge-graph) tuples, {qi, ai,KGi}i=|D|i=1 , we train ourmodel to maximize the log-likelihood of the correctanswers. We maximize the following objective:

L =∑

i

log p(ai|qi,KGi) (3)

Further details regarding the training objective aregiven in Appendix A.

5 Dataset

We experiment with two datasets, 1) Questions gen-erated based on the CLEVR (Johnson et al., 2017)dataset, and 2) Referring Expression Generation(GenX) dataset (FitzGerald et al., 2013), both ofwhich feature complex compositional queries.

CLEVRGEN: We generate a dataset of ques-tion and answers based on the CLEVR dataset(Johnson et al., 2017), which contains knowledgegraphs containing attribute information of objectsand relations between them.

We generate a new set of questions as existingquestions contain some biases that can be exploitedby models.1 We generate 75K questions for train-ing and 37.5K for validation. Our questions test var-ious challenging semantic operators. These include

1 Johnson et al. (2017) found that many spatial relationquestions can be answered only using absolute spatial infor-mation, and many long questions can be answered correctlywithout performing all steps of reasoning. We employ somesimple tests to remove trivial biases from our dataset.

conjunctions (e.g. Is anything red and large?),negations (e.g. What is not spherical?), counts (e.g.Are five spheres green?), quantifiers (e.g. Is everyred thing cylindrical?), and relations (e.g. What isleft of and above a cube?). We create two test sets:

1. Short Questions: Drawn from the same dis-tribution as the training data (37.5K).

2. Complex Questions: Longer questions thanthe training data (22.5K). This test set con-tains the same words and constructions, butchained into longer questions. For example,it contains questions such as What is a cubethat is right of a metallic thing that is beneatha blue sphere? and Are two red cylinders thatare above a sphere metallic? Solving thesequestions require more multi-step reasoning.

REFERRING EXPRESSIONS (GENX)(FitzGerald et al., 2013): This dataset con-tains human-generated queries, which identifya subset of objects from a larger set (e.g. all ofthe red items except for the rectangle). It teststhe ability of models to precisely understandhuman-generated language, which contains afar greater diversity of syntactic and semanticstructures. This dataset does not contain relationsbetween entities, and instead only focuses onentity-set operations. The dataset contains 3920questions for training, 600 for development and940 for testing. Our modules and parsing modelwere designed independently of this dataset, andwe re-use hyperparameters from CLEVRGEN.

6 Experiments

Our experiments investigate the ability of ourmodel to understand complex synthetic and nat-ural language queries, learn interpretable structure,and generalize compositionally. We also isolate theeffect of learning the syntactic structure and repre-senting sub-phrases using explicit denotations.

6.1 Experimentation SettingWe describe training details, and the baselines.

Training Details: Training the model is chal-lenging since it needs to learn both good syntac-tic structures and the complex semantics of neuralmodules—so we use Curriculum Learning (Bengioet al., 2009) to pre-train the model on an easier sub-set of questions. Appendix B contains the detailsof curriculum learning and other training details.

Model Boolean Questions Entity Set Questions Relation Questions Overall

LSTM (NO KG) 50.7 14.4 17.5 27.2LSTM 88.5 99.9 15.7 84.9BI-LSTM 85.3 99.6 14.9 83.6TREE-LSTM 82.2 97.0 15.7 81.2TREE-LSTM (UNSUP.) 85.4 99.4 16.1 83.6RELATION NETWORK 85.6 89.7 97.6 89.4Our Model (Pre-parsed) 94.8 93.4 70.5 90.8Our Model 99.9 100 100 99.9

Table 1: Results for Short Questions (CLEVRGEN): Performance of our model compared to baselinemodels on the Short Questions test set. The LSTM (NO KG) has accuracy close to chance, showing thatthe questions lack trivial biases. Our model almost perfectly solves all questions showing its ability tolearn challenging semantic operators, and parse questions only using weak end-to-end supervision.

Baseline Models: We compare to the followingbaselines. (a) Models that assume linear structureof language, and encode the question using linearRNNs—LSTM (NO KG), LSTM, BI-LSTM, anda RELATION-NETWORK (Santoro et al., 2017) aug-mented model. 2 (b) Models that assume tree-likestructure of language. We compare two variantsof Tree-structured LSTMs (Zhu et al., 2015; Taiet al., 2015)—TREE-LSTM, that operates on pre-parsed questions, and TREE-LSTM(UNSUP.), anunsupervised Tree-LSTM model (Maillard et al.,2017) that learns to jointly parse and represent thesentence. For GENX, we also use an end-to-endsemantic parsing model from Pasupat and Liang(2015). Finally, to isolate the contribution of theproposed denotational-semantics model, we trainour model on pre-parsed questions. Note that, allLSTM based models only have access to the enti-ties of the KG but not the relationship informationbetween them. See Appendix C for details.

6.2 Experiments

Short Questions Performance: Table 1 showsthat our model perfectly answers all test questions,demonstrating that it can learn challenging seman-tic operators and induce parse trees from end tasksupervision. Performance drops when using ex-ternal parser, showing that our model learns aneffective syntactic model for this domain. TheRELATION NETWORK also achieves good perfor-mance, particularly on questions involving rela-tions. LSTM baselines work well on questions notinvolving relations.3

2We use this baseline only for CLEVRGEN since GENXdoes not contain relations.

3Relation questions are out of scope for these models.

Model Non-relationQuestions

RelationQuestions Overall

LSTM (NO KG) 46.0 39.6 41.4LSTM 62.2 49.2 52.2BI-LSTM 55.3 47.5 49.2TREE-LSTM 53.5 46.1 47.8TREE-LSTM (UNSUP.) 64.5 42.6 53.6RELATION NETWORK 51.1 38.9 41.5Our Model (Pre-parsed) 94.7 74.2 78.8Our Model 81.8 85.4 84.6

Table 2: Results for Complex Questions(CLEVRGEN): All baseline models fail to gener-alize well to questions requiring longer chains ofreasoning than those seen during training. Ourmodel substantially outperforms the baselines,showing its ability to perform complex multi-hopreasoning, and generalize from its training data.Analysis suggests that most errors from our modelare due to assigning incorrect structures, rather thanmistakes by the composition modules.

Complex Questions Performance: Table 2shows results on complex questions, which are con-structed by combining components of shorter ques-tions. These require complex multi-hop reasoning,and the ability to generalize robustly to new typesof questions. We use the same models as in Table 1,which were trained on short questions. All base-lines achieve close to random performance, despitehigh accuracy for shorter questions. This showsthe challenges in generalizing RNN encoders be-yond their training data. In contrast, the stronginductive bias from our model structure allows it togeneralize well to complex questions. Our modeloutperforms TREE-LSTM (UNSUP.) and the ver-sion of our model that uses pre-parsed questions,showing the effectiveness of explicit denotationsand learning the syntax, respectively.

Model Accuracy

LSTM (NO KG) 0.0LSTM 64.9BI-LSTM 64.6TREE-LSTM 43.5TREE-LSTM (UNSUP.) 67.7SEMPRE 48.1Our Model (Pre-parsed) 67.1Our Model 73.7

Table 3: Results for Human Queries (GENX)Our model outperforms LSTM and semantic pars-ing models on complex human-generated queries,showing it is robust to work on natural language.Better performance than TREE-LSTM (UNSUP.)shows the efficacy in representing sub-phrases us-ing explicit denotations. Our model also performsbetter without an external parser, showing the ad-vantages of latent syntax.

Performance on Human-generated Language:Table 3 shows the performance of our model oncomplex human-generated queries in GENX. Ourapproach outperforms strong LSTM and semanticparsing baselines, despite the semantic parser’s useof hard-coded operators. These results suggest thatour method represents an attractive middle groundbetween minimally structured and highly structuredapproaches to interpretation. Our model learns tointerpret operators such as except that were not con-sidered during development. This shows that ourmodel can learn to parse human language, whichcontains greater lexical and structural diversity thansynthetic questions. Trees induced by the modelare linguistically plausible (see Appendix D).

Error Analysis: We find that most model errorsare due to incorrect assignments of structure, ratherthan semantic errors from the modules. For exam-ple, in the question Are four red spheres beneath ametallic thing small?, our model’s parse composesmetallic thing small into a constituent instead ofcomposing red spheres beneath a metallic thinginto a single node. Future work should exploremore sophisticated parsing models.

Discussion: While our model shows promisingresults, there is significant potential for future work.Performing exact inference over large KGs is likelyto be intractable, so approximations such as KNNsearch, beam search, feature hashing or paralleliza-tion may be necessary. To model the large numberof entities in KGs such as Freebase, techniques

proposed by recent work (Verga et al., 2017; Guptaet al., 2017) that explore representing entities ascomposition of its properties, such as, types, de-scription etc. could be used.. The modules in thiswork were designed in a way to provide good induc-tive bias for the kind of composition we expectedthem to model. For example, EV + E→ E is mod-eled as a linear composition function making iteasy to represent words such as and and or. Thesemodules can be exchanged with any other functionwith the same ‘type signature’, with different trade-offs—for example, more general feed-forward net-works with greater representation capacity wouldbe needed to represent a linguistic expression equiv-alent to xor. Similarly, more module types wouldbe required to handle certain constructions—forexample, a multiword relation such as much largerthan needs a V + V→ V module. This is an excit-ing direction for future research.

7 Related Work

Many approaches have been proposed to performquestion-answering against structured knowledgesources. Semantic parsing models have learnedstructures over pre-defined discrete operators, toproduce logical forms that can be executed to an-swer the question. Early work trained using gold-standard logical forms (Zettlemoyer and Collins,2005; Kwiatkowski et al., 2010), whereas later ef-forts have only used answers to questions (Clarkeet al., 2010; Liang et al., 2011; Krishnamurthy andKollar, 2013). A key difference is that our modelmust learn semantic operators from data, whichmay be necessary to model the fuzzy meanings offunction words like many or few.

Another similar line of work is neural pro-gram induction models, such as Neural Program-mer (Neelakantan et al., 2017) and Neural Sym-bolic Machine (Liang et al., 2017). These modelslearn to produce programs composed of predefinedoperators using weak supervision to answer ques-tions against semi-structured tables.

Neural module networks have been proposed forlearning semantic operators (Andreas et al., 2016b)for question answering. This model assumes thatthe structure of the semantic parse is given, andmust only learn a set of operators. Dynamic NeuralModule Networks (D-NMN) extend this approachby selecting from a small set of candidate modulestructures (Andreas et al., 2016a). We instead learna model over all possible structures.

Our work is most similar to N2NMN (Hu et al.,2017) model, which learns both semantic operatorsand the layout in which to compose them. How-ever, optimizing the layouts requires reinforcementlearning, which is challenging due to the high vari-ance of policy gradients, whereas our chart-basedapproach is end-to-end differentiable.

8 Conclusion

We have introduced a model for answering ques-tions requiring compositional reasoning that com-bines ideas from compositional semantics with end-to-end learning of composition operators and struc-ture. We demonstrated that the model is able tolearn a number of complex composition operatorsfrom end task supervision, and showed that thelinguistically motivated inductive bias imposed bythe structure of the model allows it to generalizewell beyond its training data. Future work shouldexplore scaling the model to other question answer-ing tasks, using more general composition modules,and introducing additional module types.

Acknowledgement

We would like to thank Shyam Upadhyay and theanonymous reviewers for their helpful suggestions.

ReferencesJacob Andreas, Marcus Rohrbach, Trevor Darrell, and

Dan Klein. 2016a. Learning to compose neural net-works for question answering. In NAACL-HLT.

Jacob Andreas, Marcus Rohrbach, Trevor Darrell, andDan Klein. 2016b. Neural module networks. InCVPR, pages 39–48.

Yoshua Bengio, Jerome Louradour, Ronan Collobert,and Jason Weston. 2009. Curriculum learning. InICML.

James Clarke, Dan Goldwasser, Ming-Wei Chang, andDan Roth. 2010. Driving semantic parsing from theworld’s response. In CoNLL.

Nicholas FitzGerald, Yoav Artzi, and Luke S. Zettle-moyer. 2013. Learning distributions over logi-cal forms for referring expression generation. InEMNLP.

Nitish Gupta, Sameer Singh, and Dan Roth. 2017. En-tity linking via joint encoding of types, descriptions,and context. In EMNLP.

Ronghang Hu, Jacob Andreas, Marcus Rohrbach,Trevor Darrell, and Kate Saenko. 2017. Learningto reason: End-to-end module networks for visualquestion answering. In ICCV.

Justin Johnson, Bharath Hariharan, Laurens van derMaaten, Li Fei-Fei, C. Lawrence Zitnick, and RossGirshick. 2017. CLEVR: A diagnostic dataset forcompositional language and elementary visual rea-soning. In CVPR.

Dan Klein and Christopher D Manning. 2003. Accu-rate unlexicalized parsing. In ACL.

Jayant Krishnamurthy and Thomas Kollar. 2013.Jointly learning to parse and perceive: Connect-ing natural language to the physical world. TACL,1:193–206.

Tom Kwiatkowski, Luke S. Zettlemoyer, Sharon Gold-water, and Mark Steedman. 2010. Inducing proba-bilistic ccg grammars from logical form with higher-order unification. In EMNLP.

Chen Liang, Jonathan Berant, Quoc Le, Kenneth D.Forbus, and Ni Lao. 2017. Neural symbolic ma-chines: Learning semantic parsers on freebase withweak supervision. In ACL.

Percy Liang, Michael I Jordan, and Dan Klein. 2011.Learning dependency-based compositional seman-tics. In ACL.

Jean Maillard, Stephen Clark, and Dani Yogatama.2017. Jointly learning sentence embeddingsand syntax with unsupervised tree-lstms. CoRR,abs/1705.09189.

Richard Montague. 1973. The proper treatment ofquantification in ordinary English. In K. J. J. Hin-tikka, J. Moravcsic, and P. Suppes, editors, Ap-proaches to Natural Language, pages 221–242. Rei-del, Dordrecht.

Arvind Neelakantan, Quoc V. Le, Martın Abadi, An-drew McCallum, and Dario Amodei. 2017. Learn-ing a natural language interface with neural program-mer. In ICLR.

Panupong Pasupat and Percy Liang. 2015. Composi-tional semantic parsing on semi-structured tables. InACL.

Ethan Perez, Harm de Vries, Florian Strub, VincentDumoulin, and Aaron C. Courville. 2017. Learn-ing visual reasoning without strong priors. CoRR,abs/1707.03017.

Adam Santoro, David Raposo, David G. T. Barrett,Mateusz Malinowski, Razvan Pascanu, Peter W.Battaglia, and Timothy P. Lillicrap. 2017. A sim-ple neural network module for relational reasoning.In NIPS.

Kai Sheng Tai, Richard Socher, and Christopher D.Manning. 2015. Improved Semantic Representa-tions From Tree-Structured Long Short-Term Mem-ory Networks. In ACL.

Patrick Verga, Arvind Neelakantan, and Andrew Mc-Callum. 2017. Generalizing to unseen entities andentity pairs with row-less universal schema. InEACL.

Adina Williams, Andrew Drozdov, and Samuel R.Bowman. 2018. Do latent tree learning modelsidentify meaningful structure in sentences? TACL,6:253–267.

Luke S Zettlemoyer and Michael Collins. 2005. Learn-ing to map sentences to logical form: Structuredclassification with probabilistic categorial grammars.UAI.

Xiaodan Zhu, Parinaz Sobihani, and Hongyu Guo.2015. Long short-term memory over recursive struc-tures. In ICML.

A Training Objective

Given a datasetD of (question, answer, knowledge-graph) tuples, {qi, ai,KGi}i=|D|i=1 , we train ourmodel to maximize the log-likelihood of the correctanswers. Answers are either booleans (a ∈ {0, 1}),or specific subsets of entities (a = {ej}) from theKG. We denote the semantic type of the answeras at. The model’s answer is found by taking thecomplete question representation, containing a dis-tribution over types and the representation for eachtype. We maximize the following objective:

L =∑

i

log p(ai|qi,KGi) (4)

=∑

i

Lib + Lie (5)

whereLib andLie are respectively the objective func-tions for questions with boolean answers and entityset answers.

Lib = 1ait=T

[log(ptrue)

ai(1− ptrue)(1−ai)]

(6)

Lie =1ait=E

|E i|[log

∏

eij∈aipeij

∏

eij /∈ai(1− peij )

](7)

We also addL2-regularization for the scalar parsingfeatures introduced in § 4.2.

B Training Details

Representing Entities: Each entity in CLEVR-GEN and GENX datasets consists of 4 attributes.For each attribute-value, we learn an embeddingvector and concatenate these vectors to form therepresentation for the entity.

Training details: For curriculum learning, forthe CLEVRGEN dataset we use a 2-step schedulewhere we first train our model on simple attributematch (What is a red sphere?), attribute existence(Is anything blue?) and boolean composition (Isanything green and is anything purple?) questionsand in the second step on all questions. For GENXwe use a 5-step, question-length based schedule,where we first train on shorter questions and even-tually on all questions.

We tune hyper-parameters using validation ac-curacy on the CLEVRGEN dataset, and use thesame hyper-parameters for both datasets. We trainusing SGD with a learning rate of 0.5, a mini-batch size of 4, and regularization constant of 0.3.When assigning the semantic type distribution tothe words at the leaves, we add a small positivebias of +1 for φφφ-type and a small negative bias of−1 for the E-type score before the softmax. Ourtrainable parameters are: question word embed-dings (64-dimensional), relation embeddings (64-dimensional), entity attribute-value embeddings(16-dimensional), four vectors per word for V-typerepresentations, parameter vector θ for the parsingmodel that contains six scalar feature scores permodule per word, and the global parameter vectorfor the E+E→E module.

C Baseline Models

C.1 LSTM (No KG)

We use a LSTM network to encode the questionas a vector q. We also define three other parame-ter vectors, t, e and b that are used to predict theanswer-type P (a = T) = σ(q · t), entity attentionvalue pei = σ(q · e), and the probability of theanswer being True ptrue = σ(q · b).

C.2 LSTM

Similar to LSTM (NO RELATION), the question isencoded using a LSTM network as vector q. Sim-ilar to our model, we learn entity attribute-valueembeddings and represent each entity as the con-catenation of the 4 attribute-value embeddings, vei .Similar to LSTM (NO RELATION), we also de-fine the t parameter vector to predict the answer-type. The entity-attention values are predicted aspei = σ(vei · q). To predict the probability of theboolean-type answer being true, we first add the en-tity representations to form b =

∑ei

vei , then make

the prediction as ptrue = σ(q · b).

C.3 Tree-LSTMTraining the Tree-LSTM model requires pre-parsedsentences for which we use a binary constituencytree generating PCFG parser (Klein and Manning,2003). We input the pre-parsed question to theTree-LSTM to get the question embedding q. Therest of the model is same the LSTM model above.

C.4 Relation Network Augmented ModelThe original formulation of the relation networkmodule is as follows:

RN(q,KG) = fφ

(∑

i,j

gθ(ei, ej , q)

)(8)

where ei, ej are the representations of the enti-ties and q is the question representation from anLSTM network. The output of the Relation Net-work module is a scalar score value for the elementsin the answer vocabulary. Since our dataset con-tains entity-set valued answers, we modified themodule in the following manner.

We concatenate the entity-pair representationswith the representations of the pair of relationsbetween them4. We use the RN-module to producean output representation for each entity as:

RNei = fφ

(∑

j

gθ(ei, ej , r1ij , r

2ij , q)

)(9)

Similar to the LSTM baselines, we define a pa-rameter vector t to predict the answer-type, andcompute the vector b to compute the probability ofthe boolean type answer being true.

To predict the entity-attention values, we use aseparate attribute-embedding matrix to first gener-ate the output representation for each entity, eouti ,then predict the output attention values as follows:

pei = σ

(RNei · eouti

)(10)

We tried other architectures as well, but this mod-ification provided the best performance on the val-idation set. We also tuned the hyper-parametersand found the setting from Santoro et al. (2017)to work the best based on validation accuracy. Weused a different 2-step curriculum to train the RE-LATION NETWORK module, in which we replacethe Boolean questions with the relation questions inthe first-schedule and jointly train on all questionsin the subsequent schedule.

4In the CLEVR dataset, between any pair of entities, only2 directed relations, left or right, and above or beneath arepresent.

C.5 SEMPRE

The semantic parsing model from (Pasupat andLiang, 2015) answers natural language queries forsemi-structured tables. The answer is a denotationas a list of cells in the table. To use the SEMPRE

framework, we convert the KGs in the GENX totables as follows:

1. Each table has the first row (header) as:| ObjId | P1 | P2 | P3 | P4 |

2. Each row contains an object id, and the 4property-attribute values in cells.

3. The answer denotation, i.e. the objects se-lected by the human annotators is now repre-sented as a list of object ids.

After converting the the KGs to tables, SEMPRE

framework can be trivially used to train and test onthe GENX dataset. We tune the number of epochsto train for based on the validation accuracy andfind 8 epochs over the training data to work thebest. We use the default setting of the other hyper-parameters.

D Example Output Parses

In Figure 3, we show example queries and theirhighest scoring output structure from our learnedmodel for GENX dataset.

every thing not a cylinder or rampEV

EE

E

EVE

EE

�� V

(a) An example output from our learned model showing that our model learns to correctly parse the questions sentence, andmodel the relevant semantic operator; or as a set union operation, to generate the correct answer denotation. It also learns tocope with lexical variability in human language; triangle being referred to as ramp.

all of the green objects except the trianglesVE�� V� E�

EEE

EE

EE

(b) An example output from our learned model that shows that our model can learn to correctly parse human-generated languageinto relatively complex structures and model semantic operators, such as except, that were not encountered during modeldevelopment.

Figure 3: Example output structures from our learned model: Examples of queries from the GENXdataset and the corresponding highest scoring tree structures from our learned model. The examplesshows that our model is able to correctly parse human-generated language and jointly learn to modelsemantic operators, such as set unions, negations etc.

Documents

Neural Compositional Denotational Semantics for …leading to a denotation for a phrase of type tare merged into an expected denotation, allowing dy-namic programming (x4). (d) Answering