13
EQUATE : A Benchmark Evaluation Framework for Quantitative Reasoning in Natural Language Inference Abhilasha Ravichander * , Aakanksha Naik * , Carolyn Rose, Eduard Hovy Language Technologies Institute, Carnegie Mellon University {aravicha, anaik, cprose, hovy}@cs.cmu.edu Abstract Quantitative reasoning is a higher-order reaso- ning skill that any intelligent natural language understanding system can reasonably be ex- pected to handle. We present EQUATE 1 (Eva- luating Quantitative Understanding Aptitude in Textual Entailment), a new framework for quantitative reasoning in textual entailment. We benchmark the performance of 9 published NLI models on EQUATE, and find that on ave- rage, state-of-the-art methods do not achieve an absolute improvement over a majority-class baseline, suggesting that they do not implicitly learn to reason with quantities. We establish a new baseline Q-REAS that manipulates quan- tities symbolically. In comparison to the best performing NLI model, it achieves success on numerical reasoning tests (+24.2%), but has limited verbal reasoning capabilities (-8.1%). We hope our evaluation framework will sup- port the development of models of quantitative reasoning in language understanding. 1 Introduction Numbers play a vital role in our lives. We re- ason with numbers in day-to-day tasks ranging from handling currency to reading news articles to understanding sports results, elections and stock markets. As numbers are used to communicate in- formation accurately, reasoning with them is an essential core competence in understanding natu- ral language (Levinson, 2001; Frank et al., 2008; Dehaene, 2011). A benchmark task in natural lan- guage understanding is natural language inference (NLI)(or recognizing textual entailment (RTE)) (Cooper et al., 1996; Condoravdi et al., 2003; Bos and Markert, 2005; Dagan et al., 2006), wherein a model determines if a natural language hypothesis * *The first two authors contributed equally to this work. 1 Code and data available at https://github.com/ AbhilashaRavichander/EQUATE. RTE-QUANT P: After the deal closes, Teva will generate sales of about $ 7 billion a year, the company said. H: Teva earns $ 7 billion a year. AWP-NLI P: Each of farmer Cunningham’s 6048 lambs is either black or white and there are 193 white ones. H: 5855 of Farmer Cunningham’s lambs are black. NEWSNLI P: Emmanuel Miller, 16, and Zachary Watson, 17, are charged as adults, police said. H: Two teen suspects charged as adults. REDDITNLI P: Oxfam says richest one percent to own more than rest by 2016. H: Richest 1% To Own More Than Half Worlds Wealth By 2016 Oxfam. Table 1: Examples from evaluation sets in EQUATE can be justifiably inferred from a given premise 2 . Making such inferences often necessitates reaso- ning about quantities. Consider the following example from Table 1, P: With 99.6% of precincts counted , Dewhurst held 48% of the vote to 30% for Cruz . H: Lt. Gov. David Dewhurst fails to get 50% of primary vote. To conclude the hypothesis is inferable, a model must reason that since 99.6% precincts are coun- ted, even if all remaining precincts vote for De- whurst, he would fail to get 50% of the primary vote. Scant attention has been paid to building da- tasets to evaluate this reasoning ability. To address this gap, we present EQUATE (Evaluating Quanti- ty Understanding Aptitude in Textual Entailment) (§3), which consists of five evaluation sets, each 2 Often, this is posed as a three-way decision where the hypothesis can be inferred to be true (entailment), false (con- tradiction) or cannot be determined (neutral). arXiv:1901.03735v2 [cs.CL] 27 Oct 2019

arXiv:1901.03735v1 [cs.CL] 11 Jan 20193 Quantitative Reasoning in NLI Our interpretation of “quantitative reasoning” draws from cognitive testing and education (Staf-ford ,1972;Ekstrom

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: arXiv:1901.03735v1 [cs.CL] 11 Jan 20193 Quantitative Reasoning in NLI Our interpretation of “quantitative reasoning” draws from cognitive testing and education (Staf-ford ,1972;Ekstrom

EQUATE : A Benchmark Evaluation Framework for QuantitativeReasoning in Natural Language Inference

Abhilasha Ravichander∗, Aakanksha Naik∗,Carolyn Rose, Eduard Hovy

Language Technologies Institute, Carnegie Mellon University{aravicha, anaik, cprose, hovy}@cs.cmu.edu

Abstract

Quantitative reasoning is a higher-order reaso-ning skill that any intelligent natural languageunderstanding system can reasonably be ex-pected to handle. We present EQUATE1 (Eva-luating Quantitative Understanding Aptitudein Textual Entailment), a new framework forquantitative reasoning in textual entailment.We benchmark the performance of 9 publishedNLI models on EQUATE, and find that on ave-rage, state-of-the-art methods do not achievean absolute improvement over a majority-classbaseline, suggesting that they do not implicitlylearn to reason with quantities. We establish anew baseline Q-REAS that manipulates quan-tities symbolically. In comparison to the bestperforming NLI model, it achieves success onnumerical reasoning tests (+24.2%), but haslimited verbal reasoning capabilities (-8.1%).We hope our evaluation framework will sup-port the development of models of quantitativereasoning in language understanding.

1 Introduction

Numbers play a vital role in our lives. We re-ason with numbers in day-to-day tasks rangingfrom handling currency to reading news articles tounderstanding sports results, elections and stockmarkets. As numbers are used to communicate in-formation accurately, reasoning with them is anessential core competence in understanding natu-ral language (Levinson, 2001; Frank et al., 2008;Dehaene, 2011). A benchmark task in natural lan-guage understanding is natural language inference(NLI)(or recognizing textual entailment (RTE))(Cooper et al., 1996; Condoravdi et al., 2003; Bosand Markert, 2005; Dagan et al., 2006), wherein amodel determines if a natural language hypothesis

∗*The first two authors contributed equally to this work.1Code and data available at https://github.com/

AbhilashaRavichander/EQUATE.

RTE-QUANT

P: After the deal closes, Teva will generate sales of about$ 7 billion a year, the company said.

H: Teva earns $ 7 billion a year.AWP-NLIP: Each of farmer Cunningham’s 6048 lambs is either

black or white and there are 193 white ones.H: 5855 of Farmer Cunningham’s lambs are black.NEWSNLIP: Emmanuel Miller, 16, and Zachary Watson, 17, are

charged as adults, police said.H: Two teen suspects charged as adults.REDDITNLIP: Oxfam says richest one percent to own more than rest

by 2016.H: Richest 1% To Own More Than Half Worlds Wealth

By 2016 Oxfam.

Table 1: Examples from evaluation sets in EQUATE

can be justifiably inferred from a given premise2.Making such inferences often necessitates reaso-ning about quantities.

Consider the following example from Table 1,

P: With 99.6% of precincts counted , Dewhurstheld 48% of the vote to 30% for Cruz .H: Lt. Gov. David Dewhurst fails to get 50% ofprimary vote.

To conclude the hypothesis is inferable, a modelmust reason that since 99.6% precincts are coun-ted, even if all remaining precincts vote for De-whurst, he would fail to get 50% of the primaryvote. Scant attention has been paid to building da-tasets to evaluate this reasoning ability. To addressthis gap, we present EQUATE (Evaluating Quanti-ty Understanding Aptitude in Textual Entailment)(§3), which consists of five evaluation sets, each

2Often, this is posed as a three-way decision where thehypothesis can be inferred to be true (entailment), false (con-tradiction) or cannot be determined (neutral).

arX

iv:1

901.

0373

5v2

[cs

.CL

] 2

7 O

ct 2

019

Page 2: arXiv:1901.03735v1 [cs.CL] 11 Jan 20193 Quantitative Reasoning in NLI Our interpretation of “quantitative reasoning” draws from cognitive testing and education (Staf-ford ,1972;Ekstrom

featuring different facets of quantitative reasoningin textual entailment (Table 1) (including verbalreasoning with quantities, basic arithmetic com-putation, dealing with approximations and rangecomparisons.).

We evaluate the ability of existing state-of-the-art NLI models to perform quantitative reasoning(§4.1), by benchmarking 9 published models onEQUATE. Our results show that most models areincapable of quantitative reasoning, instead rely-ing on lexical cues for prediction. Additionally,we build Q-REAS , a shallow semantic reasoningbaseline for quantitative reasoning in NLI (§4.2).Q-REAS is effective on synthetic test sets whichcontain more quantity-based inference, but showslimited success on natural test sets which requiredeeper linguistic reasoning. However, the hardestcases require a complex interplay between lingui-stic and numerical reasoning. The EQUATE eva-luation framework makes it clear where this newchallenge area for textual entailment stands.

2 Related Work

NLI has attracted community-wide interest as astringent test for natural language understanding(Cooper et al., 1996; Fyodorov; Glickman et al.,2005; Haghighi et al., 2005; Harabagiu and Hickl,2006; Romano et al., 2006; Dagan et al., 2006;Giampiccolo et al., 2007; Zanzotto et al., 2006;Malakasiotis and Androutsopoulos, 2007; Mac-Cartney, 2009; DeMarneffe et al., 2009; Daganet al., 2010; Angeli and Manning, 2014; Marelliet al., 2014). Recently, the creation of large-scaledatasets (Bowman et al., 2015; Williams et al.,2017; Khot et al., 2018) spurred the developmentof many neural models (Parikh et al., 2016; Nieand Bansal, 2017; Conneau et al., 2017; Balazset al., 2017; Chen et al., 2017a; Radford et al.,2018; Devlin et al., 2018).

However, state-of-the-art models for NLI treatthe task like a matching problem, which appearsto work in many cases, but breaks down in others.As the field moves past current models of the mat-ching variety to ones that embody more of thereasoning we know is part of the task, we needbenchmarks that will enable us to mark progressin the field. Prior work on challenge tasks has al-ready made headway in defining tasks for subpro-blems such as lexical inference with hypernymy,co-hyponymy, antonymy (Glockner et al., 2018;Naik et al., 2018). In this work, we specifically

probe into quantitative reasoning.

De Marneffe et al. (2008) find that in a corpusof real-life contradiction pairs collected from Wi-kipedia and Google News, 29% contradictions ari-se from numeric discrepancies, and in the RTE-3 (Recognizing Textual Entailment) developmentset, numeric contradictions make up 8.8% of con-tradictory pairs. Naik et al. (2018) find that modelinability to do numerical reasoning causes 4% oferrors made by state-of-the-art models. Sammonset al. (2010); Clark (2018) argue for a systema-tic knowledge-oriented approach in NLI by eva-luating specific semantic analysis tasks, identify-ing quantitative reasoning in particular as a focusarea. Bentivogli et al. (2010) propose creating spe-cialized datasets, but feature only 6 examples withquantitative reasoning. Our work bridges this gapby providing a more comprehensive examinationof quantitative reasoning in NLI.

While to the best of our knowledge, prior workhas not studied quantitative reasoning in NLI, Roy(2017) propose a model for a related subtask cal-led quantity entailment, which aims to determineif a given quantity can be inferred from a sen-tence. In contrast, our work is concerned withgeneral-purpose textual entailment which consi-ders if a given sentence can be inferred fromanother. Our work also relates to solving arith-metic word problems (Hosseini et al., 2014; Mi-tra and Baral, 2016; Zhou et al., 2015; Upad-hyay et al., 2016; Huang et al., 2017; Kushmanet al., 2014a; Koncel-Kedziorski et al., 2015; ?;Roy, 2017; Ling et al., 2017a). A key difference isthat word problems focus on arithmetic reasoning,while the requirement for linguistic reasoning andworld knowledge is limited as the text is conci-se, straightforward, and self-contained (Hosseiniet al., 2014; Kushman et al., 2014b). Our workprovides a testbed that evaluates basic arithmeticreasoning while incorporating the complexity ofnatural language.

Recently, ? also recognize the importanceof quantitative reasoning for text understanding.They propose DROP, a reading comprehension da-taset focused on a limited set of discrete opera-tions such as counting, comparison, sorting andarithmetic. In contrast, EQUATE features diver-se phenomena that occur naturally in text, inclu-ding reasoning with approximation, ordinals, im-plicit quantities and quantifiers, requiring NLI mo-dels to reason comprehensively about the inter-

Page 3: arXiv:1901.03735v1 [cs.CL] 11 Jan 20193 Quantitative Reasoning in NLI Our interpretation of “quantitative reasoning” draws from cognitive testing and education (Staf-ford ,1972;Ekstrom

play between quantities and language. Additio-nally, through EQUATE we suggest the inclusionof controlled synthetic tests in evaluation bench-marks. Controlled tests act as basic validation ofmodel behaviour, isolating model ability to reasonabout a property of interest.

3 Quantitative Reasoning in NLI

Our interpretation of “quantitative reasoning”draws from cognitive testing and education (Staf-ford, 1972; Ekstrom et al., 1976), which considersit “verbal problem-solving ability”. While inextri-cably linked to mathematics, it is an inclusive skillinvolving everyday language rather than a specia-lized lexicon. To excel at quantitative reasoning,one must interpret quantities expressed in langua-ge, perform basic calculations, judge their accura-cy, and justify quantitative claims using verbal andnumeric reasoning. These requirements show a re-ciprocity: NLI lends itself as a test bed for quan-titative reasoning, which conversely, is importantfor NLI (Sammons et al., 2010; Clark, 2018). Mo-tivated by this, we present the EQUATE (Evalua-ting Quantity Understanding Aptitude in TextualEntailment) framework.

3.1 The EQUATE DatasetEQUATE consists of five NLI test sets featuringquantities. Three of these tests for quantitative re-asoning feature language from real-world sourcessuch as news articles and social media (§3.2; §3.3;§3.4). We focus on sentences containing quantitieswith numerical values, and consider an entailmentpair to feature quantitative reasoning if it is at leastone component of the reasoning required to de-termine the entailment label (but not necessarilythe only reasoning component). Quantitative rea-soning features quantity matching, quantity com-parison, quantity conversion, arithmetic, qualitati-ve processes, ordinality and quantifiers, quantitynoun and adverb resolution3 as well as verbal re-asoning with the quantity’s textual context4. Ap-pendix B gives some examples for these quanti-tative phenomena. We further filter sentence pairswhich require only temporal reasoning, since spe-cialized knowledge is needed to reason about time.These three test sets contain pairs which confla-te multiple lexical and quantitative reasoning phe-

3Such as the quantities represented in dozen, twice, teena-gers.

4For example, 〈Obama cuts tax rate to 28%, Obama wantsto cut tax rate to 28% as part of overhaul〉.

nomena. In order to study aspects of quantitativereasoning in isolation, EQUATE further featurestwo controlled synthetic tests (§3.5; §3.6), evalua-ting model ability to reason with quantifiers andperform simple arithmetic. We intend that modelsreporting performance on any NLI dataset additio-nally evaluate on the EQUATE benchmark, to de-monstrate competence at quantitative reasoning.

3.2 RTE-Quant

This test set is constructed from the RTE sub-corpus for quantity entailment (Roy, 2017), ori-ginally drawn from the RTE2-RTE4 datasets (Da-gan et al., 2006). The original sub-corpus conflatestemporal and quantitative reasoning. We discardedpairs requiring temporal reasoning, obtaining a setof 166 entailment pairs.

3.3 NewsNLI

This test set is created from the CNN corpus (Her-mann et al., 2015) of news articles with abstrac-tive summaries. We identify summary points withquantities, filtering out temporal expressions. Fora summary point, the two most similar sentences5

from the article are chosen, flipping pairs whe-re the premise begins with a first-person pronoun(eg:〈“He had nine pears”, “Bob had nine pears”〉becomes 〈“Bob had nine pears”, “He had ninepears”〉). The top 50% of similar pairs are retai-ned to avoid lexical overlap bias. We crowdsourceannotations for a subset of this data from Ama-zon Mechanical Turk. Crowdworkers6 are showntwo sentences and asked to determine whether thesecond sentence is definitely true, definitely fal-se, or not inferable given the first. We collect 5annotations per pair, and consider pairs with lo-west token overlap between premise and hypo-thesis and least difference in premise-hypothesislengths when stratified by entailment label. Top1000 samples meeting these criteria form our fi-nal set. To validate crowdsourced labels, expertsare asked to annotate 100 pairs. Crowdsourcedgold labels match expert gold labels in 85% cases,while individual crowdworker labels match expertgold labels in 75.8%. Disagreements are manual-ly resolved by experts and examples not featuringquantitiative reasoning are filtered, leaving a set of968 samples.

5According to Jaccard similarity.6We require crowdworkers to have an approval rate of

95% on at least 100 tasks and pass a qualification test.

Page 4: arXiv:1901.03735v1 [cs.CL] 11 Jan 20193 Quantitative Reasoning in NLI Our interpretation of “quantitative reasoning” draws from cognitive testing and education (Staf-ford ,1972;Ekstrom

Source Test Set Size Classes Data Source AnnotationSource

Quantitative Phenomena

RTE-Quant 166 2 RTE2-RTE4 Experts Arithmetic, Ranges, Quan-tifiers

Natural NewsNLI 968 2 CNN Crowdworkers Ordinals, Quantifiers,Arithmetic, Approximati-on, Magnitude, Ratios

RedditNLI 250 3 Reddit Experts Range, Arithmetic, Appro-ximation, Verbal

Stress Test 7500 3 AQuA-RAT Automatic QuantifiersSynthetic AwpNLI 722 2 Arithmetic Word

ProblemsAutomatic Arithmetic

Table 2: An overview of test sets included in EQUATE. RedditNLI and Stress Test are framed as 3-class (entail-ment, neutral, contradiction) while RTE-Quant, NewsNLI and AwpNLI are 2-class (entails=yes/no). RTE 2-4 for-mulate entailment as a 2-way decision. We find that few news article headlines are contradictory, thus NewsNLI issimilarly framed as a 2-way decision. For algebra word problems, substituting the wrong answer in the hypothesisnecessarily creates a contradiction under the event coreference assumption (De Marneffe et al., 2008), thus it isframed as a 2-way decision as well.

3.4 RedditNLI

This test set is sourced from the popular soci-al forum \reddit7. Since reasoning about quanti-ties is important in domains like finance or eco-nomics, we scrape all headlines from the postson \r\economics, considering titles that containquantities and do not have meta-forum informa-tion. Titles appearing within three days of eachother are clustered by Jaccard similarity, and thetop 300 pairs are extracted. After filtering outnonsensical titles, such as concatenated stock pri-ces, we are left with 250 sentence pairs. Similarto RTE, two expert annotators label these pairs,achieving a Cohen’s kappa of 0.82. Disagreementsare discussed to resolve final labels.

3.5 Stress Test

We include the numerical reasoning stress testfrom (Naik et al., 2018) as a synthetic sanitycheck. The stress test consists of 7500 entail-ment pairs constructed from sentences in algebraword problems (Ling et al., 2017b). Focusing onquantifiers, it requires models to compare entitiesfrom hypothesis to the premise while incorpora-ting quantifiers, but does not require them to per-form the computation from the original algebraword problem (eg: 〈“NHAI employs 100 men tobuild a highway of 2 km in 50 days working 8hours a day”,“NHAI employs less than 700 mento build a highway of 2 km in 50 days working 8hours a day”〉).

7According to the Reddit User Agreement, users grantReddit the right to make their content available to other or-ganizations or individuals.

3.6 AwpNLI

To evaluate arithmetic ability of NLI models, werepurpose data from arithmetic word problems (?).They have the following characteristic structure.First, they establish a world and optionally updateits state. Then, a question is posed about the world.This structure forms the basis of our pair creationprocedure. World building and update statementsform the premise. A hypothesis template is gene-rated by identifying modal/auxiliary verbs in thequestion, and subsequent verbs, which we call se-condary verbs. We identify the agent and conju-gate the secondary verb in present tense followedby the identified unit to form the final template(for example, the algebra word problem ‘Gary had73.0 dollars. He spent 55.0 dollars on a pet sna-ke. How many dollars did Gary have left?’ wouldgenerate the hypothesis template ‘Agent(Gary)Verb(Has) Answer(18.0) Unit(dollars) left’). Forevery template, the correct guess is used to createan entailed hypothesis. Contradictory hypothesesare created by randomly sampling a wrong guess(x ∈ Z+ if correct guess is an integer, and x ∈ R+

if it is a real number) 8. We check for grammatica-lity, finding only 2% ungrammatical hypotheses,which are manually corrected leaving a set of 722pairs.

8From a uniform distribution over an interval of 10 aroundthe correct guess (or 5 for numbers less than 5), to identifyplausible wrong guesses.

Page 5: arXiv:1901.03735v1 [cs.CL] 11 Jan 20193 Quantitative Reasoning in NLI Our interpretation of “quantitative reasoning” draws from cognitive testing and education (Staf-ford ,1972;Ekstrom

Figure 1: Overview of Q-REAS baseline.

4 Models

We describe the 9 NLI models9 used in this study,as well as our new baseline. The interested readeris invited to refer to the corresponding publicati-ons for further details.

4.1 NLI Models

1) Majority Class (MAJ): Simple baseline thatalways predicts the majority class in test set.2) Hypothesis-Only (HYP): FastText classifier(Mikolov et al., 2018) trained on only hypotheses(Gururangan et al., 2018).3) ALIGN: A bag-of-words alignment model in-spired by MacCartney (2009).10

4) CBOW: A simple bag-of-embeddings sentencerepresentation model (Williams et al., 2017).5) BiLSTM: The simple BiLSTM model descri-bed by Williams et al. (2017).6) Chen (CH): Stacked BiLSTM-RNNs withshortcut connections and character word embed-dings (Chen et al., 2017b).7) InferSent: A single-layer BiLSTM-RNN mo-del with max-pooling (Conneau et al., 2017).8) SSEN: Stacked BiLSTM-RNNs with shortcutconnections (Nie and Bansal, 2017).9) ESIM: Sequential inference model proposedby Chen et al. (2017a) which uses BiLSTMs withan attention mechanism.10) OpenAI GPT: Transformer-based languagemodel (Vaswani et al., 2017), with finetuning onNLI (Radford et al., 2018).11) BERT: Transformer-based language model

9Accuracy of all models on MultiNLI closely matches ori-ginal publications (numbers in appendix A).

10Model accuracy on RTE-3 test is 61.12%, comparable tothe reported average model performance in the RTE compe-tition of 62.4% .

INPUTPc Set of “compatible” single-valued premise quanti-

tiesPr Set of “compatible” range-valued premise quanti-

tiesH Hypothesis quantityO Operator set {+,−, ∗, /,=,∩,∪, \,⊆}L Length of equation to be generatedSL Symbol list (Pc ∪ Pr ∪H ∪O)TL Type list (set of types from Pc, Pr, H)N Length of symbol listK Index of first range quantity in symbol listM Index of first operator in symbol listOUTPUTei Index of symbol assigned to ith position in postfix

equationVARIABLESxi Main ILP variable for position ici Indicator variable: is ei a single value?ri Indicator variable: is ei a range?oi Indicator variable: is ei an operator?di Stack depth of eiti Type index for ei

Table 3: Input, output and variable definitions for theInteger Linear Programming (ILP) framework used forquantity composition

(Vaswani et al., 2017), with a cloze-style and next-sentence prediction objective, and finetuning onNLI (Devlin et al., 2018).

4.2 Q-REAS Baseline System

Figure 1 describes the Q-REAS baseline forquantitative reasoning in NLI. The model manipu-lates quantity representations symbolically to ma-ke entailment decisions, and is intended to serveas a strong heuristic baseline for numerical reaso-ning on the EQUATE benchmark. This model hasfour stages: Quantity mentions are extracted andparsed into semantic representations called NUM-SETS (§4.2.1, §4.2.2); compatible NUMSETS areextracted (§4.2.3) and composed (§4.2.4) to formjustifications; Justifications are analyzed to deter-mine entailment labels (§4.2.5).

4.2.1 Quantity Segmenter

We follow Barwise and Cooper (1981) in definingquantities as having a number, unit, and an optio-nal approximator. Quantity mentions are identifiedas least ancestor noun phrases from the constituen-cy parse of the sentence containing cardinal num-bers.

4.2.2 Quantity Parser

The quantity parser constructs a grounded repre-sentation for each quantity mention in the premise

Page 6: arXiv:1901.03735v1 [cs.CL] 11 Jan 20193 Quantitative Reasoning in NLI Our interpretation of “quantitative reasoning” draws from cognitive testing and education (Staf-ford ,1972;Ekstrom

Definitional ConstraintsRange restriction xi < K or xi = M − 1 for i ∈ [0, L− 1] if ci = 1

xi ≥ K and xi < M for i ∈ [0, L− 1] if ri = 1xi ≥M for i ∈ [0, L− 1] if oi = 1

Uniqueness ci + ri + oi = 1 for i ∈ [0, L− 1]Stack definition d0 = 0 (Stack depth initialization)

di = di−1 − 2oi + 1 for i ∈ [0, L− 1] (Stack depth update)Syntactic Constraints

First two operands c0 + r0 = 1 and c1 + r1 = 1Last operator xL−1 ≥ N − 1 (Last operator should be one of {=,⊆})Last operand xL−2 = M − 1 (Last operand should be hypothesis quantity)Other operators xi ≤ N − 2 for i ∈ [0, L− 3] if oi = 1Other operands xi < K for i ∈ [0, L− 3] if ci = 1

xi < M for i ∈ [0, L− 3] if ri = 1Empty stack dL−1 = 0 (Non-empty stack indicates invalid postfix expression)Premise usage xi 6= xj for i, j ∈ [0, L− 1] if oi 6= 1, oj 6= 1

Operand AccessRight operand op2(xi) = xi−1 for i ∈ [0, L− 1] such that oi = 1Left operand op1(xi) = xl for i, l ∈ [0, L − 1] where oi = 1 and l is the largest index such that

l ≤ (i− 2) and dl = di

Table 4: Mathematical validity constraint definitions for the ILP framework. Functions op1() and op2() return theleft and right operands for an operator respectively. Variables defined in table 3.

Type Consistency ConstraintsType assignment ti = TL[k] for i ∈ [0, L− 1] if ci + ri = 1 and type(SLi) = kTwo type match ti = ta = tb for i ∈ [0, L − 1] such that oi = 1, xi ∈ {+,−, ∗, /,=,∩,∪, \,⊆}, a =

op1(xi), b = op2(xi)One type match ti ∈ {ta, tb}, ta 6= tb for i ∈ [0, L − 1] such that oi = 1, xi = ∗, a = op1(xi), b =

op2(xi)ti = ta 6= tb for i ∈ [0, L− 1] such that oi = 1, xi = /, a = op1(xi), b = op2(xi)

Operator Consistency ConstraintsArithmetic operators ca = cb = 1 for i ∈ [0, L − 1] such that oi = 1, xi ∈ {+,−, ∗, /,=}, a = op1(xi), b =

op2(xi)Range operators ra = rb = 1 for i ∈ [0, L−1] such that oi = 1, xi ∈ {∩,∪, \}, a = op1(xi), b = op2(xi)

rb = 1 for i ∈ [0, L− 1] such that oi = 1, xi =⊆, b = op2(xi)

Table 5: Linguistic consistency constraint definitions for the ILP framework. Functions op1() and op2() return theleft and right operands for an operator respectively. Variables defined in table 3.

or hypothesis, henceforth known as a NUMSET 11.A NUMSET is a tuple (val, unit, ent, adj, loc, verb,freq, flux)12 with:

1. val ∈ [R,R]: quantity value represented as arange2. unit ∈ S: unit noun associated with quantity3. ent ∈ Sφ: entity noun associated with unit (e.g.,donations worth 100$)4. adj ∈ Sφ: adjective associated with unit ifany13,5. loc ⊆ Sφ: location of unit (e.g.,’in the bag’)14

6. verb ∈ Sφ: action verb associated with quanti-

11A NUMSET may be a composition of other NUMSETS .12As in (Koncel-Kedziorski et al., 2015) S denotes all pos-

sible spans in the sentence, φ represents the empty span, andSφ=S ∪ φ

13Extracted as governing verb linked to entity by an amodrelation.

14Extracted as prepositional phrase attached to the quantityand containing noun phrase.

ty15.7. freq ⊆ Sφ: if quantity recurs16 (e.g, ’per hour’),8. flux ∈ {increase to, increase from, decrease to,decrease from}φ: if quantity is in a state of flux17.

To extract values for a quantity, we extract car-dinal numbers, recording contiguity. We norma-lize the number18. We also handle simple ratiossuch as quarter, half etc, and extract bounds (eg:fewer than 10 apples is parsed to [−∞, 10] app-les.)

To extract units, we examine tokens adjacent

15Extracted as governing verb linked to entity by dobj ornsubj relation.

16extracted using keywords per and every17using gazetteer: increasing, rising, rose, decreasing, fal-

ling, fell, drop18(remove “,”s, convert written numbers to float, decide the

numerical value, for example hundred fifty eight thousandis 158000, two fifty eight is 258, 374m is 3740000 etc.). Ifcardinal numbers are non-adjacent, we look for an explicitlymentioned range such as ‘to’ and ‘between’.

Page 7: arXiv:1901.03735v1 [cs.CL] 11 Jan 20193 Quantitative Reasoning in NLI Our interpretation of “quantitative reasoning” draws from cognitive testing and education (Staf-ford ,1972;Ekstrom

MD RTE-Q ∆ NewsNLI ∆ RedditNLI ∆ NR ST ∆ AWPNLI ∆

Nat.Avg. ∆

Synth.Avg. ∆

AllAvg. ∆

MAJ 57.8 0.0 50.7 0.0 58.4 0.0 33.3 0.0 50.0 0.0 +0.0 +0.0 +0.0HYP 49.4 -8.4 52.5 +1.8 40.8 -17.6 31.2 -2.1 50.1 +0.1 -8.1 -1.0 -5.2ALIGN 62.1 +4.3 56.0 +5.3 34.8 -23.6 22.6 -10.7 47.2 -2.8 -4.7 -6.8 -5.5CBOW 47.0 -10.8 61.8 +11.1 42.4 -16.0 30.2 -3.1 50.7 +0.7 -5.2 -1.2 -3.6BiLSTM 51.2 -6.6 63.3 +12.6 50.8 -7.6 31.2 -2.1 50.7 +0.7 -0.5 -0.7 -0.6CH 54.2 -3.6 64.0 +13.3 55.2 -3.2 30.3 -3.0 50.7 +0.7 +2.2 -1.2 +0.9InferSent 66.3 +8.5 65.3 +14.6 29.6 -28.8 28.8 -4.5 50.7 +0.7 -1.9 -1.9 -1.9SSEN 58.4 +0.6 65.1 +14.4 49.2 -9.2 28.4 -4.9 50.7 +0.7 +1.9 -2.1 +0.3ESIM 54.8 -3.0 62.0 +11.3 45.6 -12.8 21.8 -11.5 50.1 +0.1 -1.5 -5.7 -3.2GPT 68.1 +10.3 72.2 +21.5 52.4 -6.0 36.4 +3.1 50.0 +0.0 +8.6 +1.6 +5.8BERT 57.2 -0.6 72.8 +22.1 49.6 -8.8 36.9 +3.6 42.2 -7.8 +4.2 -2.1 +1.7

Q-REAS 56.6 -1.2 61.1 +10.4 50.8 -7.6 63.3 +30 71.5 +21.5 +0.5 +25.8 +10.6

Table 6: Accuracies(%) of 9 NLI Models on five tests for quantitiative reasoning in entailment. M and D repre-sent models and datasets respectively. ∆ captures improvement over majority-class baseline for a dataset. ColumnNat.Avg. reports the average accuracy(%) of each model across 3 evaluation sets constructed from natural sources(RTE-Quant, NewsNLI, RedditNLI), whereas Synth.Avg. reports the average accuracy(%) on 2 synthetic evalua-tion sets (Stress Test, AwpNLI). Column Avg. represents the average accuracy(%) of each model across all 5evaluation sets in EQUATE.

to cardinal numbers in the quantity mention andidentify known units. If no known units are found,we assign the token in a numerical modifier re-lationship with the cardinal number, else we as-sign the nearest noun to the cardinal number asthe unit. A quantity is determined to be approxi-mate if the word in an adverbial modifier relationwith the cardinal number appears in a gazetteer19.If approximate, range is extended to (+/-)2% of thecurrent value.

4.2.3 Quantity Pruner

The pruner constructs “compatible” premise-hypothesis NUMSET pairs. Consider the pair “In-surgents killed 7 U.S. soldiers, set off a car bombthat killed four Iraqi policemen” and “7 US sol-diers were killed, and at least 10 Iraqis died”. Ourparser extracts NUMSETS corresponding to “fourIraqi policemen” and “7 US soldiers” from pre-mise and hypothesis respectively. But these NUM-SETS should not be compared as they involve dif-ferent units. The pruner discards such incompa-tible pairs. Heuristics to identify unit-compatibleNUMSET pairs include three cases- 1) direct stringmatch, 2) synonymy/hypernymy relations fromWordNet, 3) one unit is a nationality/ job20 and theother unit is synonymous with person (Roy, 2017).

19roughly, approximately, about, nearly, roundabout,around, circa, almost, approaching, pushing, more or less,in the neighborhood of, in the region of, on the orderof,something like, give or take (a few), near to, close to, inthe ballpark of

20Lists of jobs, nationalities scraped from Wikipedia.

4.2.4 Quantity CompositionThe composition module detects whether a hypo-thesis NUMSET is justified by composing “compa-tible” premise NUMSETS . For example, considerthe pair “I had 3 apples but gave one to my bro-ther” and “I have two apples”. Here, the premiseNUMSETS P1 (“3 apples”) and P2 (“one apple”)must be composed to deduce that the hypothesisNUMSET H1 (“2 apples”) is justified. Our fra-mework accomplishes this by generating postfixarithmetic equations21 from premise NUMSETS ,that justify the hypothesis NUMSET 22. In this ex-ample, the expression < P1, P2,−, H1,=> willbe generated.

The set of possible equations is exponential innumber of NUMSETS , making exhaustive genera-tion intractable. But a large number of equationsare invalid as they violate constraints such as unitconsistency. Thus, our framework uses integer li-near programming (ILP) to constrain the equationspace. It is inspired by prior work on algebra wordproblems (Koncel-Kedziorski et al., 2015), withsome key differences:

1. Arithmetic equations: We focus on arithmeticequations instead of algebraic ones.2. Range arithmetic: Quantitative reasoning in-volves ranges, which are handled by representingthen as endpoint-inclusive intervals and adding

21Note that arithmetic equations differ from algebraicequations in that they do not contain unknown variables

22Direct comparisons are incorporated by adding “=” as anoperator.

Page 8: arXiv:1901.03735v1 [cs.CL] 11 Jan 20193 Quantitative Reasoning in NLI Our interpretation of “quantitative reasoning” draws from cognitive testing and education (Staf-ford ,1972;Ekstrom

the four operators (∪,∩, \,⊆)3. Hypothesis quantity-driven: We optimize anILP model for each hypothesis NUMSET becausea sentence pair is marked “entailment” iff everyhypothesis quantity is justified.

Table 3 describes ILP variables. We impose thefollowing types of constraints:1. Definitional Constraints: Ensure that ILPvariables take on valid values by constraininginitialization, range, and update.2. Syntactic Constraints: Assure syntactic vali-dity of generated postfix expressions by limitingoperator-operand ordering.3. Operand Access: Simulate stack-basedevaluation correctly by choosing correct operator-operand assignments.4. Type Consistency: Ensure that all operationsare type-compatible.5. Operator Consistency: Force range operatorsto have range operands and mathematical opera-tors to have single-valued operands.

Definitional, syntactic, and operand accessconstraints ensure mathematical validity while ty-pe and operator consistency constraints add lingui-stic consistency. Constraint formulations are pro-vided in Tables 4 and 5. We limit tree depth to 3and retrieve a maximum of 50 solutions per hypo-thesis NUMSET , then solve to determine whetherthe equation is mathematically correct. We discardequations that use invalid operations (division by0) or add unnecessary complexity (multiplication/division by 1). The remaining equations are consi-dered plausible justifications .

4.2.5 Global ReasonerThe global reasoner predicts the final entailmentlabel as shown in Algorithm 123, on the assumpti-on that every NUMSET in the hypothesis has to bejustified 24 for entailment.

5 Results and Discussion

Table 6 presents results on EQUATE. All models,except Q-REAS are trained on MultiNLI. Q-

23MaxSimilarityClass() takes two quantities and returns aprobability distribution over entailment labels based on unitmatch. Similarly, ValueMatch() detects whether two quanti-ties match in value (this function can also handle ranges).

24This is a necessary but not sufficient condition for entail-ment. Consider the example, 〈‘Sam believed Joan had 5 app-les’, ‘Joan had 5 apples’〉. The hypothesis quantities of 5 app-les is justified but is not a sufficient condition for entailment.

Algorithm 1 PredictEntailmentLabel(P,H,C,E)Input: Premise quantities P , Hypothesis quanti-

ties H , Compatible pairs C, Equations EOutput: Entailment label l ∈ { e, c, n }

1: if C = ∅ then return n

2: J ← ∅3: L← []4: for qh ∈ H do5: Jh ← {qp | qp ∈ P, (qp, qh) ∈ C}6: J ← J ∪ {(qh, Jh)}7: L← L + [false]

8: for (qh, Jh) ∈ J do9: if Jh = ∅ then return n

10: for qp ∈ Jh do11: s←MaxSimilarityClass(qp, qh)12: if s = e then13: if ValueMatch(qp, qh) then14: L[qh] = true

15: if !ValueMatch(qp, qh) then16: L[qh] = false

17: if s = c then18: if ValueMatch(qp, qh) then19: L[qh] = c

20: for qh ∈ H do21: Eq ← {ei ∈ E | hyp(ei) = qh}22: if Eq 6= ∅ then23: L[qh] = true

24: if c ∈ L then return c

25: if count(L, true) = len(L) then return e26: return n

REAS utilizes WordNet and lists from Wikipedia.We observe that neural models, particularlyOpenAI GPT excel at verbal aspects of quantita-tive reasoning (RTE-Quant, NewsNLI), whereasQ-REAS excels at numerical aspects (Stress Test,AwpNLI).

5.1 Neural Models on NewsNLI:

To tease apart contributory effects of numericaland verbal reasoning in natural data, we expe-riment with NewsNLI. We extract all entailedpairs where a quantity appears in both premiseand hypothesis, and perturb the quantity in thehypothesis generating contradictory pairs. Forexample, the pair 〈‘In addition to 79 fatalities ,some 170 passengers were injured.’〉 ‘The crashtook the lives of 79 people and injured some 170’,‘entailment’ is changed to 〈‘In addition to 79

Page 9: arXiv:1901.03735v1 [cs.CL] 11 Jan 20193 Quantitative Reasoning in NLI Our interpretation of “quantitative reasoning” draws from cognitive testing and education (Staf-ford ,1972;Ekstrom

fatalities , some 170 passengers were injured.’,‘The crash took the lives of 80 people and injuredsome 170’, ‘contradiction’〉, assuming scalarimplicature and event coreference. Our perturbedtest set contains 218 pairs. On this set, GPT25

achieves an accuracy of 51.18%, as compared to72.04% on the unperturbed set, suggesting themodel relies on verbal cues rather than numericalreasoning. In comparison, Q-REAS achieves anaccuracy of 98.1% on the perturbed set, comparedto 75.36% on the unperturbed set, highlightingreliance on quantities rather than verbal infor-mation. Closer examination reveals that OpenAIswitches to predicting the ‘neutral’ category forperturbed samples instead of entailment, accoun-ting for 42.7% of its errors, possibly symptomaticof lexical bias issues (Naik et al., 2018).

5.2 What Quantitative Phenomena AreHard?

We sample 100 errors made by Q-REAS oneach test in EQUATE, to identify phenomena notaddressed by simple quantity comparison. Ouranalysis of causes for error suggest avenues forfuture research:

1. Multi-step numerical-verbal reasoning:Models do not perform well on examplesrequiring interleaved verbal and quantitativereasoning, especially multi-step deduction. Con-sider the pair 〈“Two people were injured in theattack”, “Two people perpetrated the attack”〉.Quantities “two people” and “two people” areunit-compatible, but must not be compared.Another example is the NewsNLI entailmentpair in Table 1. This pair requires us to identifythat 16 and 17 refer to Emmanuel and Zachary’sages (quantitative), deduce that this implies theyare teenagers (verbal) and finally count them(quantitative) to get the hypothesis quantity “twoteens”. Numbers and language are intricatelyinterleaved and developing a reasoner capableof handling such complex interplay is challenging.

2. Lexical inference: Lack of real world know-ledge causes errors in identifying quantitiesand valid comparisons. Errors include mappingabbreviations to correct units (“m” to “meters”),detecting part-whole coreference (“seats” can be

25the best-performing neural model on EQUATE.

used to refer to “buses”), and resolving hyperny-my/hyponymy (“young men” to “boys”).

3. Inferring underspecified quantities: Quantityattributes can be implicitly specified, requiringinference to generate a complete representation.Consider “A mortar attack killed four peopleand injured 80”. A system must infer that thequantity “80” refers to people. On RTE-Quant,20% of such cases stem from zero anaphora, ahard problem in coreference resolution.

4. Arithmetic comparison limitations: These ex-amples require composition between incompatiblequantities. For example, consider 〈“There were 3birds and 6 nests”, “There were 3 more nests thanbirds”〉. To correctly label this pair “3 birds” and“6 nests” must be composed.

6 Conclusion

In this work, we present EQUATE, an evaluationframework to estimate the ability of models to re-ason quantitatively in textual entailment. We ob-serve that existing neural approaches rely heavilyon the lexical matching aspect of the task to suc-ceed rather than reasoning about quantities. Weimplement a strong symbolic baseline Q-REASthat achieves success at numerical reasoning, butlacks sophisticated verbal reasoning capabilities.The EQUATE resource presents an opportuni-ty for the community to develop powerful hy-brid neuro-symbolic architectures, combining thestrengths of neural models with specialized rea-soners such as Q-REAS . We hope our insightslead to the development of models that can moreprecisely reason about the important, frequent, butunderstudied, phenomena of quantities in naturallanguage.

Acknowledgments

This research was supported in part by grantsfrom the National Science Foundation Secu-re and Trustworthy Computing program (CNS-1330596, CNS-15-13957, CNS-1801316, CNS-1914486) and a DARPA Brandeis grant (FA8750-15-2-0277). The views and conclusions containedherein are those of the authors and should not beinterpreted as necessarily representing the officialpolicies or endorsements, either expressed or im-plied, of the NSF, DARPA, or the US Government.The author Naik was supported by a fellowship

Page 10: arXiv:1901.03735v1 [cs.CL] 11 Jan 20193 Quantitative Reasoning in NLI Our interpretation of “quantitative reasoning” draws from cognitive testing and education (Staf-ford ,1972;Ekstrom

from the Center of Machine Learning and Healthat Carnegie Mellon University. The authors wouldlike to thank Graham Neubig, Mohit Bansal andDongyeop Kang for helpful discussion regardingthis work, and Shruti Rijhwani and Siddharth Dal-mia for reviews while drafting this paper. The au-thors are also grateful to Lisa Carey Lohmuellerand Xinru Yan for volunteering their time for pilotstudies.

ReferencesGabor Angeli and Christopher D Manning. 2014. Na-

turalli: Natural logic inference for common sense re-asoning. In Proceedings of the 2014 conference onempirical methods in natural language processing(EMNLP), pages 534–545.

Jorge Balazs, Edison Marrese-Taylor, Pablo Loyola,and Yutaka Matsuo. 2017. Refining raw sentence re-presentations for textual entailment recognition viaattention. In Proceedings of the 2nd Workshop onEvaluating Vector Space Representations for NLP,pages 51–55, Copenhagen, Denmark. Associationfor Computational Linguistics.

Jon Barwise and Robin Cooper. 1981. Generalizedquantifiers and natural language. In Philosophy,Language, and Artificial Intelligence, pages 241–301. Springer.

Luisa Bentivogli, Elena Cabrio, Ido Dagan, DaniloGiampiccolo, Medea Lo Leggio, and Bernardo Ma-gnini. 2010. Building textual entailment specializeddata sets: a methodology for isolating linguistic phe-nomena relevant to inference. In Proceedings of theSeventh conference on International Language Re-sources and Evaluation (LREC’10), Valletta, Mal-ta. European Languages Resources Association (EL-RA).

Johan Bos and Katja Markert. 2005. Recognising tex-tual entailment with logical inference. In Procee-dings of the conference on Human Language Tech-nology and Empirical Methods in Natural LanguageProcessing, pages 628–635. Association for Compu-tational Linguistics.

Samuel R. Bowman, Gabor Angeli, Christopher Potts,and Christopher D. Manning. 2015. A large anno-tated corpus for learning natural language inference.In Proceedings of the 2015 Conference on Empiri-cal Methods in Natural Language Processing, pages632–642, Lisbon, Portugal. Association for Compu-tational Linguistics.

Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, HuiJiang, and Diana Inkpen. 2017a. Enhanced lstm fornatural language inference. In Proceedings of the55th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers), pa-ges 1657–1668, Vancouver, Canada. Association forComputational Linguistics.

Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, HuiJiang, and Diana Inkpen. 2017b. Recurrent neuralnetwork-based sentence encoder with gated attenti-on for natural language inference. In Proceedingsof the 2nd Workshop on Evaluating Vector SpaceRepresentations for NLP, pages 36–40, Copenha-gen, Denmark. Association for Computational Lin-guistics.

Peter Clark. 2018. What knowledge is needed to solvethe rte5 textual entailment challenge? arXiv preprintarXiv:1806.03561.

Cleo Condoravdi, Dick Crouch, Valeria De Paiva,Reinhard Stolle, and Daniel G Bobrow. 2003.Entailment, intensionality and text understanding.In Proceedings of the HLT-NAACL 2003 workshopon Text meaning-Volume 9, pages 38–45. Associati-on for Computational Linguistics.

Alexis Conneau, Douwe Kiela, Holger Schwenk, LoıcBarrault, and Antoine Bordes. 2017. Supervisedlearning of universal sentence representations fromnatural language inference data. In Proceedings ofthe 2017 Conference on Empirical Methods in Natu-ral Language Processing, pages 670–680, Copenha-gen, Denmark. Association for Computational Lin-guistics.

Robin Cooper, Dick Crouch, Jan Van Eijck, Chris Fox,Johan Van Genabith, Jan Jaspars, Hans Kamp, Da-vid Milward, Manfred Pinkal, Massimo Poesio, et al.1996. Using the framework. Technical report.

Ido Dagan, Bill Dolan, Bernardo Magnini, and DanRoth. 2010. The fourth pascal recognizing textualentailment challenge. Journal of Natural LanguageEngineering.

Ido Dagan, Oren Glickman, and Bernardo Magnini.2006. The pascal recognising textual entailmentchallenge. In Machine learning challenges. evalua-ting predictive uncertainty, visual object classificati-on, and recognising tectual entailment, pages 177–190. Springer.

Marie-Catherine De Marneffe, Anna N. Rafferty, andChristopher D. Manning. 2008. Finding contradicti-ons in text. In Proceedings of ACL-08: HLT, pages1039–1047, Columbus, Ohio. Association for Com-putational Linguistics.

Stanislas Dehaene. 2011. The number sense: How themind creates mathematics. OUP USA.

Marie-Catherine DeMarneffe, Sebastian Pado, andChristopher D Manning. 2009. Multi-word expres-sions in textual inference: Much ado about nothing?In Proceedings of the 2009 Workshop on AppliedTextual Inference, pages 1–9. Association for Com-putational Linguistics.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kris-tina Toutanova. 2018. Bert: Pre-training of deep bi-directional transformers for language understanding.arXiv preprint arXiv:1810.04805.

Page 11: arXiv:1901.03735v1 [cs.CL] 11 Jan 20193 Quantitative Reasoning in NLI Our interpretation of “quantitative reasoning” draws from cognitive testing and education (Staf-ford ,1972;Ekstrom

Ruth B Ekstrom, Diran Dermen, and Harry HoraceHarman. 1976. Manual for kit of factor-referencedcognitive tests, volume 102. Educational TestingService Princeton, NJ.

Michael C Frank, Daniel L Everett, Evelina Fedoren-ko, and Edward Gibson. 2008. Number as a cogni-tive technology: Evidence from piraha language andcognition. Cognition, 108(3):819–824.

Yaroslav Fyodorov. A natural logic inference system.Citeseer.

Danilo Giampiccolo, Bernardo Magnini, Ido Dagan,and Bill Dolan. 2007. The third pascal recognizingtextual entailment challenge. In Proceedings of theACL-PASCAL workshop on textual entailment andparaphrasing, pages 1–9. Association for Computa-tional Linguistics.

Oren Glickman, Ido Dagan, and Moshe Koppel. 2005.Web based probabilistic textual entailment.

Max Glockner, Vered Shwartz, and Yoav Goldberg.2018. Breaking nli systems with sentences that re-quire simple lexical inferences. In Proceedings ofthe 56th Annual Meeting of the Association for Com-putational Linguistics (Volume 2: Short Papers), pa-ges 650–655, Melbourne, Australia. Association forComputational Linguistics.

Suchin Gururangan, Swabha Swayamdipta, Omer Le-vy, Roy Schwartz, Samuel R Bowman, and Noah ASmith. 2018. Annotation artifacts in natural langua-ge inference data. arXiv preprint arXiv:1803.02324.

Aria Haghighi, Andrew Ng, and Christopher Man-ning. 2005. Robust textual inference via graph mat-ching. In Proceedings of Human Language Techno-logy Conference and Conference on Empirical Me-thods in Natural Language Processing.

Sanda Harabagiu and Andrew Hickl. 2006. Methodsfor using textual entailment in open-domain ques-tion answering. In Proceedings of the 21st Interna-tional Conference on Computational Linguistics andthe 44th annual meeting of the Association for Com-putational Linguistics, pages 905–912. Associationfor Computational Linguistics.

Karl Moritz Hermann, Tomas Kocisky, Edward Gre-fenstette, Lasse Espeholt, Will Kay, Mustafa Suley-man, and Phil Blunsom. 2015. Teaching machinesto read and comprehend. In Advances in Neural In-formation Processing Systems, pages 1693–1701.

Mohammad Javad Hosseini, Hannaneh Hajishirzi,Oren Etzioni, and Nate Kushman. 2014. Learningto solve arithmetic word problems with verb catego-rization. In Proceedings of the 2014 Conference onEmpirical Methods in Natural Language Processing(EMNLP), pages 523–533.

Danqing Huang, Shuming Shi, Chin-Yew Lin, and Ji-an Yin. 2017. Learning fine-grained expressions tosolve math word problems. In Proceedings of the

2017 Conference on Empirical Methods in NaturalLanguage Processing, pages 805–814.

Tushar Khot, Ashish Sabharwal, and Peter Clark. 2018.Scitail: A textual entailment dataset from sciencequestion answering.

Rik Koncel-Kedziorski, Hannaneh Hajishirzi, AshishSabharwal, Oren Etzioni, and Siena Dumas Ang.2015. Parsing algebraic word problems into equa-tions. Transactions of the Association for Computa-tional Linguistics, 3:585–597.

Nate Kushman, Yoav Artzi, Luke Zettlemoyer, and Re-gina Barzilay. 2014a. Learning to automatically sol-ve algebra word problems. In Proceedings of the52nd Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), volu-me 1, pages 271–281.

Nate Kushman, Luke S. Zettlemoyer, Regina Barzilay,and Yoav Artzi. 2014b. Learning to automaticallysolve algebra word problems. In ACL.

Stephen C Levinson. 2001. Pragmatics. In Inter-national Encyclopedia of Social and BehavioralSciences: Vol. 17, pages 11948–11954. Pergamon.

Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blun-som. 2017a. Program induction by rationale genera-tion: Learning to solve and explain algebraic wordproblems. arXiv preprint arXiv:1705.04146.

Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blun-som. 2017b. Program induction by rationale gene-ration: Learning to solve and explain algebraic wordproblems. In Proceedings of the 55th Annual Mee-ting of the Association for Computational Lingui-stics (Volume 1: Long Papers), pages 158–167, Van-couver, Canada. Association for Computational Lin-guistics.

Bill MacCartney. 2009. Natural language inference.Stanford University.

Prodromos Malakasiotis and Ion Androutsopoulos.2007. Learning textual entailment using svms andstring similarity measures. In Proceedings of theACL-PASCAL Workshop on Textual Entailment andParaphrasing, pages 42–47. Association for Com-putational Linguistics.

Marco Marelli, Stefano Menini, Marco Baroni, LuisaBentivogli, Raffaella Bernardi, Roberto Zamparelli,et al. 2014. A sick cure for the evaluation of compo-sitional distributional semantic models.

Tomas Mikolov, Edouard Grave, Piotr Bojanowski,Christian Puhrsch, and Armand Joulin. 2018. Ad-vances in pre-training distributed word representa-tions. In Proceedings of the International Con-ference on Language Resources and Evaluation(LREC 2018).

Page 12: arXiv:1901.03735v1 [cs.CL] 11 Jan 20193 Quantitative Reasoning in NLI Our interpretation of “quantitative reasoning” draws from cognitive testing and education (Staf-ford ,1972;Ekstrom

Arindam Mitra and Chitta Baral. 2016. Learning touse formulas to solve simple arithmetic problems.In Proceedings of the 54th Annual Meeting of theAssociation for Computational Linguistics (Volume1: Long Papers), volume 1, pages 2144–2153.

Aakanksha Naik, Abhilasha Ravichander, Norman Sa-deh, Carolyn Rose, and Graham Neubig. 2018.Stress test evaluation for natural language inference.In Proceedings of the 27th International Conferenceon Computational Linguistics, pages 2340–2353,Santa Fe, New Mexico, USA. Association for Com-putational Linguistics.

Yixin Nie and Mohit Bansal. 2017. Shortcut-stackedsentence encoders for multi-domain inference. InProceedings of the 2nd Workshop on EvaluatingVector Space Representations for NLP, pages 41–45, Copenhagen, Denmark. Association for Compu-tational Linguistics.

Ankur P Parikh, Oscar Tackstrom, Dipanjan Das, andJakob Uszkoreit. 2016. A decomposable attentionmodel for natural language inference. arXiv preprintarXiv:1606.01933.

Alec Radford, Karthik Narasimhan, Tim Salimans, andIlya Sutskever. 2018. Improving language under-standing by generative pre-training.

Lorenza Romano, Milen Kouylekov, Idan Szpektor, IdoDagan, and Alberto Lavelli. 2006. Investigating ageneric paraphrase-based approach for relation ex-traction.

Subhro Roy. 2017. Reasoning about quantities in na-tural language. Ph.D. thesis, University of Illinoisat Urbana-Champaign.

Mark Sammons, VG Vydiswaran, and Dan Roth. 2010.Ask not what textual entailment can do for you... InProceedings of the 48th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 1199–1208. Association for Computational Linguistics.

Richard E Stafford. 1972. Hereditary and environmen-tal components of quantitative reasoning. Review ofEducational Research, 42(2):183–201.

Shyam Upadhyay, Ming-Wei Chang, Kai-Wei Chang,and Wen-tau Yih. 2016. Learning from explicit andimplicit supervision jointly for algebra word pro-blems. In Proceedings of the 2016 Conference onEmpirical Methods in Natural Language Proces-sing, pages 297–306.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. arXiv preprint arXiv:1706.03762.

Adina Williams, Nikita Nangia, and Samuel R Bow-man. 2017. A broad-coverage challenge corpus forsentence understanding through inference. arXivpreprint arXiv:1704.05426.

F Zanzotto, Alessandro Moschitti, Marco Pennacchiot-ti, and M Pazienza. 2006. Learning textual entail-ment from examples. In Second PASCAL recogni-zing textual entailment challenge, page 50. PAS-CAL.

Lipu Zhou, Shuaixiang Dai, and Liwei Chen. 2015.Learn to solve algebra word problems using quadra-tic programming. In Proceedings of the 2015 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 817–822.

Appendix

A Baseline performance onMultiNLI-Dev Matched

Model MultiNLI DevHyp Only 53.18%ALIGN 45.0%CBOW 63.5%BiLSTM 70.2%Chen 73.7%NB 74.2%InferSent 70.3%ESIM 76.2%OpenAI Transformer 81.35%BERT 83.8%

Table 7: Performance of all baseline models used in thepaper on the matched subset of MultiNLI-Dev

Table 7 presents classification accuracies of allbaseline models used in this work on the matchedsubset of MultiNLI-Dev. These scores are veryclose to the numbers reported by the original pu-blications, affirming the correctness of our baseli-ne setup.

B Examples of quantitative phenomenapresent in EQUATE

Table 8 presents some examples from EQUATEwhich demonstrate interesting quantitative pheno-mena that must be understood to label the pair cor-rectly.

Page 13: arXiv:1901.03735v1 [cs.CL] 11 Jan 20193 Quantitative Reasoning in NLI Our interpretation of “quantitative reasoning” draws from cognitive testing and education (Staf-ford ,1972;Ekstrom

Phenomenon Example

ArithmeticP: Sharper faces charges in Arizona and CaliforniaH: Sharper has been charged in two states

RangesP: Between 20 and 30 people were trapped in the casinoH: Upto 30 people thought trapped in casino

QuantifiersP: Poll: Obama over 50% in FloridaH: New poll shows Obama ahead in Florida

OrdinalsP: Second-placed Nancy celebrated their 40th anniversary with a winH: Nancy stay second with a win

ApproximationP: Rwanda has dispatched 1917 soldiersH: Rwanda has dispatched some 1900 soldiers

RatiosP: Londoners had the highest incidence of E. Coli bacteria (25%)H: 1 in 4 Londoners have E. Coli bacteria

ComparisonP: Treacherous currents took four lives on the Alabama Gulf coastH: Rip currents kill four in Alabama

ConversionP: If the abuser has access to a gun, it increases chances of death by 500%H: Victim five times more likely to die if abuser is armed

NumerationP: Eight suspects were arrestedH: 8 suspects have been arrested

ImplicitQuantities

P: The boat capsized two more timesH: His sailboat capsized three times

Table 8: Examples of quantitative phenomena present in EQUATE