Explainable Inference Over Grounding-Abstract Chains for

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1–12August 1–6, 2021. ©2021 Association for Computational Linguistics

1

Explainable Inference Over Grounding-Abstract Chainsfor Science Questions

Mokanarangan Thayaparan†, Marco Valentino†, Andre Freitas†‡Department of Computer Science, University of Manchester, United Kingdom†

Idiap Research Institute, Switzerland‡

[email protected]

AbstractWe propose an explainable inference approachfor science questions by reasoning on ground-ing and abstract inference chains. This paperframes question answering as a natural lan-guage abductive reasoning problem, construct-ing plausible explanations for each candidateanswer and then selecting the candidate withthe best explanation as the final answer. Ourmethod, ExplanationLP, elicits explanationsby constructing a weighted graph of relevantfacts for each candidate answer and employsa linear programming formalism designed toselect the optimal subgraph of explanatoryfacts. The graphs’ weighting function is com-posed of a set of parameters targeting rele-vance, cohesion and diversity, which we fine-tune for answer selection via Bayesian Opti-misation. We carry out our experiments on theWorldTree and ARC-Challenge datasets to em-pirically demonstrate the following contribu-tions: (1) ExplanationLP obtains strong perfor-mance when compared to transformer-basedand multi-hop approaches despite having a sig-nificantly lower number of parameters; (2) Weshow that our model is able to generate plausi-ble explanations for answer prediction; (3) Ourmodel demonstrates better robustness towardssemantic drift when compared to transformer-based and multi-hop approaches.

1 Introduction

Answering science questions remain a fundamen-tal challenge in Natural Language Processing andAI as it requires complex forms of inference, in-cluding causal, model-based and example-basedreasoning (Jansen, 2018; Clark et al., 2018; Jansenet al., 2016; Clark et al., 2013). Current state-of-the-art (SOTA) approaches for answering questions inthe science domain are dominated by transformer-based models (Devlin et al., 2019; Sun et al., 2019).Despite remarkable performance on answer pre-diction, these approaches are black-box by nature,

lacking the capability of providing explanations fortheir predictions (Thayaparan et al., 2020; Miller,2019; Biran and Cotton, 2017; Jansen et al., 2016).

Explainable Science Question Answering(XSQA) is often framed as a natural languageabductive reasoning problem (Khashabi et al.,2018; Jansen et al., 2017). Abductive reasoningrepresents a distinct inference process, knownas inference to the best explanation (Peirce,1960; Lipton, 2017), which starts from a set ofcomplete or incomplete observations to find thehypothesis, from a set of plausible alternatives,that best explains the observations. Severalapproaches (Khashabi et al., 2018; Jansen et al.,2017; Khot et al., 2017a; Khashabi et al., 2016)employ this form of reasoning for multiple-choicescience questions to build a set of plausibleexplanations for each candidate answer and selectthe one with the best explanation as the finalanswer.

XSQA solvers typically treat explanation gener-ation as a multi-hop graph traversal problem. Here,the solver attempts to compose multiple facts thatconnect the question to a candidate answer. Thesemulti-hop approaches have shown diminishing re-turns with an increasing number of hops (Jansenet al., 2018; Jansen, 2018). Fried et al. (2015) con-clude that this phenomenon is due to semantic drift– i.e., as the number of aggregated facts increases,so does the probability of drifting out of context.Khashabi et al. (2019) propose a theoretical frame-work, empirically supported by Jansen et al. (2018);Fried et al. (2015), attesting that ongoing effortswith very long multi-hop reasoning chains are un-likely to succeed, emphasising the need for a richerrepresentation with fewer hops and higher impor-tance to abstraction and grounding mechanisms.

Consider the example in Figure 1A where thecentral concept the question examines is the under-standing of friction. Here, an inference solver’s

2

[]: Explanatory Facts[]: Non-Explanatory Facts

What is an example of force producing heat?

Candidate Answer (C1):Two sticks getting warm when rubbed together

Grounding Facts:

[] a stick is an object: FG1[] friction is a force: FG2[] a pull is a force: FG3[] to rub together means to move against: FG4[] rubbing against something is kind of movement: FG5

Abstract Facts:

Abstract Facts:

[] friction occurs when two object's surfaces move against each other: FC1[] friction causes the temperature of an object to increases: FC2[] magnetic attraction pulls two objects together: FC3

Question(Q):

Relevant Facts Retrieval

BayesianOptimisation

Candidate Answers

Answer Selection

Correct Answer

H1

FG1

stickFG2

force force

FG3force

FG5

rub

FG4

rub together

FC1

object

friction

moveagainst

FC2friction

FC3together

pull

: Cohesion : Diversity

Grouding

: Relevance

together

(Two sticks getting warm when rubbed together)

For each Candidate Hypothesis:

: Explanatory Facts : Non-Explanatory Facts

(A) (B) (C)

rub

cvAbstractFacts KB

cvGroundingFacts KB

Hypothesis (H1):Two sticks getting warm when rubbed togetheris an example of force producing heat

Abstract

Fact Graph Construction

Subgraph extraction with LinearProgramming Optimization

Extract Subgraph:

Fact Graph Construction:

Question

Figure 1: Overview of our approach: (A) Depicts a question, answer and formulated hypothesis along with the setof facts retrieved from a fact retrieval approach (B) Illustrates the optimisation process behind extracting explana-tory facts for the provided hypothesis and facts. (C) Details the end-to-end architecture diagram.

challenge is to identify the core scientific facts(Abstract Facts) that best explain the answer. Toachieve this goal, a QA solver should be able firstto go from force to friction, stick to object andrubbing together to move against. These are theGrounding Facts that link generic or abstract con-cepts in a core scientific statement to specific termsoccurring in question and candidate answer (Jansenet al., 2018). The grounding process is followed bythe identification of the abstract facts about friction.A complete explanation for this question wouldrequire the composition of five facts to derive thecorrect answer successfully. However, it is pos-sible to reduce the global reasoning in two hops,modelling it with grounding and abstract facts.

In line with these observations, this workpresents a novel approach that explicitly modelsabstract and grounding mechanisms. The contribu-tions of the paper are:

1. We present a novel approach that performsnatural language abductive reasoning viagrounding-abstract chains combining LinearProgramming with Bayesian optimisation forscience question answering (Section 2).

2. We obtain comparable performance whencompared to transformers, multi-hop ap-proaches and previous Linear Programmingmodels despite having a significantly lowernumber of parameters (Section 3.1).

3. We demonstrate that our model can generateplausible explanations for answer prediction(Section 3.2) and validate the importance ofgrounding-abstract chains via ablation analy-sis (Section 3.3).

2 ExplanationLP: Abductive Reasoningwith Linear Programming

ExplanationLP answers and explains multiple-choice science questions via abductive natural lan-guage reasoning. Specifically, the task of answer-ing multiple-choice science questions is reformu-lated as the problem of finding the candidate an-swer that is supported by the best explanation. Foreach Question Q and candidate answer ci ∈ C,ExplanationLP converts to a hypothesis hi and at-tempts to construct a plausible explanation.

Figure 1C illustrates the end-to-end framework.From an initial set of facts selected using a re-trieval model, ExplanationLP constructs a factgraph where each node is a fact, and the nodesand edges have a score according to three prop-erties: relevance, cohesion and diversity. Subse-quently, an optimal subgraph is extracted usingLinear Programming, whose role is to select thebest sub-set of facts while preserving structuralconstraints imposed via grounding-abstract chains.The subgraphs’ global scores computed by sum-ming up the nodes and edges scores are adopted toselect the final answer. Since the subgraph scoresdepend on the sum of nodes and edge scores, eachproperty is multiplied by a learnable weight which

3

is optimised via Bayesian Optimisation to obtainthe best possible combination with the highest ac-curacy for answer selection. To the best of ourknowledge, we are the first to combine a parameteroptimisation method with Linear Programming forinference. The rest of this section describes themodel in detail.

2.1 Relevant facts retrival

Given a question (Q) and candidate answers C =c1, c2, c3, ..., cn we convert them to hypothe-ses h1, h2, h3, ..., hn using the approach pro-posed by Demszky et al. (2018). For each hy-pothesis hi we adopt fact retrieval approaches(e.g: BM25, Unification-retrieval (Valentino et al.,2021)) to select the top m relevant abstract factsF hiA = fhi

1 , fhi2 , f

hi3 , ..., f

him from a knowl-

edge base containing abstract facts (Abstract FactsKB) and top l relevant grounding facts F hi

G =

fhi1 , f

hi2 , f

hi3 , ..., f

hil from a knowledge base

containing grounding facts (Grounding Facts KB)that at least connects one abstract fact with the hy-pothesis, such that F hi = F hi

A ∪FhiG and l+m = k.

2.2 Fact graph construction

For each hypothesis hi we build a weighted undi-rected graph Ghi = (V hi , Ehi , ωv, ωe) withvertices V hi ∈ hi ∪ F hi, edges Ehi , edge-weight function ωe(ei; θ1) and node-weight func-tion ωv(vi; θ2) where ei ∈ Ehi , vi ∈ V hi andθ1, θ2 ∈ [0, 1] is a learnable parameter which isoptimised via Bayesian optimisation.

The model scores the nodes and edges based onthe following three properties (See Figure 1B):

(1) Relevance: We promote the inclusion of highlyrelevant facts in the explanations by encouragingthe selection of sentences with higher lexical rele-vance and semantic similarity with the hypothesis.We use the following scores to measure the rele-vance and the semantic similarity of the facts:Lexical Relevance score (L): Obtained from theupstream facts retrieval model (e.g: BM25 score/Unification score (Valentino et al., 2021)).Semantic Similarity score (S): Cosine similarityobtained from neural sentence representationmodels. For our experiments, we adopt Sentence-BERT (Reimers et al., 2019) since it showsstate-of-the-art performance in semantic textualsimilarity tasks.

(2) Cohesion: Explanations should be cohesive,implying that grounding-abstract chains should re-main within the same context. To achieve cohe-sion, we encourage a high degree of overlaps be-tween different hops (e.g. hypothesis-grounding,grounding-abstract, hypothesis-abstract) to preventthe inference chains from drifting away from theoriginal context. The overlap across two hops isquantified using the following scoring function:Cohesion score (C): We denote the set of uniqueterms of a given fact fhi

i as t(fhii ) after being lem-

matized and stripped of stopwords. The overlapscore of two facts fhi

j and fhij is given by:

C(fhij , f

hik ) =

|t(fhij ) ∩ t(fhi

k )|max(|t(fhi

j )|, |t(fhik )|)

Therefore, the higher the number of term overlaps,the higher the cohesion score.

(3) Diversity: While maximizing relevance and co-hesion between different hops, we encourage diver-sity between facts of the same type (e.g. abstract-abstract, grounding-grounding) to address differentparts of the hypothesis and promote completenessin the explanations. We measure diversity via thefollowing function:Diversity score (D): We denote the overlaps be-tween hypothesis hi and the fact fhi

i as thi(fhi

i ) =

t(fhii )∩ t(hi). The diversity score of two facts fhi

j

and fhij is given by:

D(fhij , f

hik ) = −1

|thi(fhi

j ) ∩ thi(fhi

k )|max(|thi

(fhij )|, |thi

(fhik )|)

The goal is to maximise diversity and avoid redun-dant facts in the explanations. Therefore, if twofacts overlap with different parts of the hypothesis,they will have a higher diversity score compared totwo facts that overlap with the same part.

Given these premises, the weight functions ofthe graph is designed as follows:

ωe(vj , vk; θ1) =

θggD(vj , vk) vj , vk ∈ F hiG

θaaD(vj , vk) vj , vk ∈ F hiA

θgaC(vj , vk) vj ∈ F hiG , vk ∈ F hi

A

θqgC(vj , vk) vj ∈ F hiG , vk = hi

θqaC(vj , vk) vj ∈ F hiA , vk = hi

ωv(vhii ; θ2) =

θlrL(vj , hi) + θssS(vj , hi) vj ∈ Fhi

A

0 vi ∈ FhiG

0 vi = hi

4

where θgg, θaa, θga, θgq, θqa ∈ θ1 and θlr, θss ∈θ2.

2.3 Subgraph extraction with LinearProgramming (LP) optimisation

The construction of the explanation graph has tobe optimised for the downstream answer selectiontask. Specifically, from the whole set of facts re-trieved by the upstream retrieval models, we needto select the optimal subgraph that maximises theperformance of answer prediction. To achieve thisgoal, we adopt a Linear Programming approach.

The selection of the explanation graph is framedas a rooted maximum-weight connected subgraphproblem with a maximum number of K vertices(R-MWCSK). This formalism is derived from thegeneralized maximum-weight connected subgraphproblem (Loboda et al., 2016). R-MWCSK has twoparts: objective function to be maximized and con-straints to build a connected subgraph of explana-tory facts. The formal definition of the objectivefunction is as follows:Definition 1. Given a connected undirected graphG = (V,E) with edge-weight function ωe : E →IR, node-weight function ωv : V → IR , root ver-tex r ∈ V and expected number of vertices K, therooted maximum-weight connected subgraph prob-lem with K number of vertices (R-MWCSK) prob-lem is finding the connected subgraph G = (V , E)such that r ∈ V , |V |≤ K and

Ω(G; θ3) = θvw∑v∈V

ωv(v; θ1)

+ θew∑e∈E

ωe(e; θ2)→ max

where θvw, θew ∈ θ3, θ3 ∈ [0, 1] and θ3 is a learn-able parameter optimized via Bayesian optimisa-tion. The LP solver will seek to extract the optimalsubgraph with the highest possible sum of node andedge weights. Since the solver seeks to obtain thehighest possible score, it will avoid negative edgesand will prioritise high-value positive edges result-ing in higher diversity, cohesion and relevance. Weadopt the following binary variables to representthe presence of nodes and edges in the subgraph:

1. Binary variable yv takes the value of 1 iff v ∈V hi belongs to the subgraph.

2. Binary variable ze takes the value of 1 iff e ∈Ehi belongs to the subgraph.

In order to emulate the grounding-abstract infer-ence chains and obtain a valid subgraph, we imposethe set constraints described in Table 1 for the LPsolver.

2.4 Bayesian Optimisation for AnswerSelection

Given Question Q and choices C =c1, c2, c3, ..., cn we extract the optimal expla-nation graphs GQ = Gc1 , Gc2 , Gc3 , ..., Gcnfor each choice. We consider the hypothesiswith the highest relevance, cohesion and di-versity to be the correct the answer. Based onthis premise we define the correct answer ascans = arg maxhi

(Ω(Ghi)).In order to automatically optimize the Linear

Programming model (i.e, θ1, θ2, θ3) we useBayesian optimisation. The algorithm is defined asbelow (Here GP is Gaussian Process and LP is theLinear Programming module).

Algorithm 1: Bayesian Optimisationθ1, θ2, θ3 = initRandom(seed)GQ = fact-graph-construction(ωe(θ

′1), ωv(θ

′2))

GQ = LP(GQ, Ω(θ3))X = evaluate-accuracy(GQ)model = GP(X, θ1, θ2, θ3)iteration = 0while iteration ≤ N do

θ′1, θ

′2, θ

′3 = get-next-exploration-point()

GQ′

= fact-graph-construction(ωe(θ′1), ωv(θ

′2))

GQ′

= LP(GQ′, Ω(θ

′3))

X′

= evaluate-accuracy(GQ′)

model.update(X′, θ

′1, θ

′2, θ

′3)

iteration = iteration + 1endResult: Best accuracy for model and respective

parameters θ1, θ2, θ3

3 Empirical Evaluation

Background Knowledge: We construct the re-quired knowledge bases using the followingsources.(1) Abstract KB: Our Abstract knowledge baseis constructed from the WorldTree Tablestore cor-pus (Xie et al., 2020; Jansen et al., 2018). TheTablestore corpus contains a set of common senseand scientific facts adopted to create explanationsfor multiple-choice science questions. The corpusis built for answering elementary science questionsencouraging possible knowledge reuse to elicit ex-planatory patterns. We extract the core scientificfacts to build the Abstract KB. Core scientific facts

5

yvi = 1 if vi = hi (1)

yvi ≤∑j

yvj ∀vj ∈ NGhi (vi) (2)

zvi,vj ≤ yvi ∀e(vi,vj) ∈ E (3)

zvi,vj ≤ yvj ∀e(vi,vj) ∈ E (4)

zvi,vj ≥ yvi + yvj − 1 ∀e(vi,vj) ∈ E (5)

Chaining constraint: Equation 1 states that the subgraph should al-ways contain the hypothesis node. Inequality 2 states that if a vertexis to be part of the subgraph, then at least one of its neighbors with alexical overlap should also be part of the subgraph. Equation 1 andInequality 2 restrict the LP method to construct explanations that orig-inate from the hypothesis and perform multi-hop aggregation basedon the existence of lexical overlap. Inequalities 3, 4 and 5 state that iftwo vertices are in the subgraph then the edges connecting the verticesshould be also in the subgraph. These inequality constraints will forcethe LP method to avoid grounding nodes with high overlap regardlessof their relevance.

∑i

yvi ≤K ∀vi ∈ F hiA (6)

Abstract fact limit constraint: Equation 6 limits the total numberof abstract facts to K. Instead of limiting of total selected numberof nodes to K, by limiting the abstract facts we dictate the need forgrounding facts based on the number of terms present in the hypothesisand in the abstract facts.

∑vj

yvi − 2 ≥− 2(1− yvj ) ∀vi ∈ NGhi (vj),

vi ∈ FhiA ∪ hi,

vj ∈ FhiG

(7)

Grounding neighbor constraint: Inequality 7 states that if a ground-ing fact is selected, then at least two of its neighbors should be eitherboth abstract facts or a hypothesis and an abstract fact. This con-straint ensures that grounding facts play the linking role connectinghypothesis-abstract facts.

Table 1: Linear programming constraints employed by ExplanationLP to emulate grounding-abstract inference chains andextract the optimal subgraph

are independent from the specific questions and rep-resent general scientific and commonsense knowl-edge, such as Actions (friction occurs whentwo object’s surfaces move against each other) orAffordances (friction causes the temperatureof an object to increase).(2) Grounding KB: The grounding knowledgebase consists of definitional knowledge (e.g.,synonymy and taxonomy) that can take intoaccount lexical variability of questions and helpit link it to abstract facts. To achieve this goal,we select the is-a and synonymy facts fromConceptNet (Speer et al., 2017) as our groundingfacts. ConceptNet has high coverage and precision,enabling us to answer a wide variety of questions.

Question Sets: We use the following questionsets to evaluate ExplanationLP’s performance andcompare it against other explainable approaches:(1) WorldTree Corpus: The 2,290 questions inthe WorldTree corpus are split into three differentsubsets: train-set (987), dev-set (226) and test-set(1,077). We use the dev-set to assess the explain-ability performance and robustness analysis sincethe explanations for test-set are not publicly avail-able.(2) ARC-Challenge Corpus: ARC-Challenge is a

multiple-choice question dataset which consistsof question from science exams from grade 3 tograde 9 (Clark et al., 2018). We only considerthe Challenge set of questions. These questionshave proven to be challenging to answer forother LP-based question answering and neuralapproaches. ExplanationLP rely only on thetrain-set (1,119) and test on the test-set (1,172).ExplanationLP does not require dev-set, since thepossibility of over-fitting is non-existent with onlyten parameters.

Relevant Facts Retrieval (FR): We experimentwith two different fact retrieval scores. The firstmodel – i.e. BM25 Retrieval, adopts a BM25 vec-tor representation for hypothesis and explanationfacts. We apply this retrieval for both Groundingand Abstract retrieval. We use the IDF score fromBM25 as our downstream model’s relevance score.The second approach – i.e. Unification Retrieval(UR), represents the BM25 implementation of theUnification-based Reconstruction framework de-scribed in Valentino et al. (2021). The unificationscore for a given fact depends on how often thesame fact appears in explanations for similar ques-tions.Baselines: The following baselines are replicated

6

on the WorldTree corpus to compare against Expla-nationLP:(1) Bert-Based models: We compare the Ex-planationLP model’s performance against a setof BERT baselines. The first baseline – i.e.BERTBase/BERTLarge, is represented by a stan-dard BERT language model (Devlin et al., 2019)fine-tuned for multiple-choice question answering.Specifically, the model is trained for binary clas-sification on each question-candidate answer pairto maximize the correct choice (i.e., predict 1) andminimize the wrong choices (i.e., predict 0). Dur-ing inference, we select the choice with the highestprediction score as the correct answer. BERT base-lines are further enhanced with explanatory facts re-trieved by the retrieval models. BERT + BM25 andBERT + UR, is fine-tuned for binary classificationby complementing the question-answer pair withgrounding and abstract facts selected by BM25 andUnification retrieval, respectively.

Similarly, the second model BERT + UR comple-ments the question-answer pair with grounding andabstract facts selected using BM25 and Unificationretrieval, respectively.(2) PathNet (Kundu et al., 2019): PathNet is a neu-ral approach that constructs a single linear pathcomposed of two facts connected via entity pairsfor reasoning. PathNet also can explain its rea-soning via explicit reasoning paths. They haveexhibited strong performance for multiple-choicescience questions by composing two facts. Sim-ilar to Bert-based models, we employ PathNETwith the top k facts retrieved utilizing Unification(PathNet + UR) and BM25 (PathNet + BM25) re-trieval. We concatenate the facts retrieved for eachcandidate answer and provide as supporting facts.

Further details regarding the hyperparametersand code used for each model, along with informa-tion concerning the knowledge base constructionand dataset information, can be found in the Sup-plementary Materials.

3.1 Answer Selection

WorldTree Corpus: We retrieve the top l relevantgrounding facts from Grounding KB and the topm relevant abstract facts from Abstract KB suchthat l + m = k and l = m. To ensure fairnessacross the approaches, the same amount of factsare presented to each model. We experimentedwith k = 10, 20, 30, 40, 50 and report theaccuracy across Easy and Challenge split of the

# Model AccuracyEasy Challenge

1 BERTBase 51.04 28.752 BERTLarge 54.58 29.39

3 BERTBase + BM25 (k=10) 53.92 42.724 BERTLarge + BM25 (k=10) 54.05 43.455 BERTBase + UR (k=10) 52.87 42.176 BERTLarge + UR (k=10) 58.50 43.72

7 PathNet + BM25 (k=20) 43.32 36.428 PathNet + UR (k=15) 47.64 33.55

9 Ours + BM25 (k=30) 63.82 48.2410 Ours + UR (k=30) 66.23 50.15

Table 2: Accuracy on Easy (764) and Challenge split (313)of WorldTree test-set corpus from the best performing k ofeach model

# Model Explainable Accuracy

1 BERTLarge No 35.11

2 IR Solver (Clark et al., 2016) Yes 20.263 TupleInf (Khot et al., 2017b) Yes 23.834 TableILP (Khashabi et al., 2016) Yes 26.975 DGEM (Clark et al., 2016) Partial 27.116 KGˆ2 (Zhang et al., 2018) Partial 31.707 ET-RR (Ni et al., 2019) Partial 36.618 Unsupervised AHE (Yadav

et al., 2019a)Partial 33.87

9 Supervised AHE (Yadav et al.,2019a)

Partial 34.47

10 AutoRocc (Yadav et al., 2019b) Partial 41.24

11 Ours + BM25 (k=40) Yes 40.2112 Ours + UR (k=40) Yes 39.84

Table 3: ARC challenge scores compared with other Fullyor Partially explainable approaches trained only on the ARCdataset.

best performing setting in Table 2. We draw thefollowing conclusions:(1) Despite having a smaller number of param-eters to train (BERTBase: 110M parameters,BERTLarge: 340M parameters, ExplanationLP: 9parameters), the best performing ExplanationLP(#10) overall outperforms all the BERTBase andBERTLarge models on both Challenge and Easysplit. We outperform the best performing BERTmodel with facts (BERTLarge (#6)) by 7.74% inEasy and 6.43% in Challenge. We also outperformbest performing BERT without facts (BERTLarge

(#2)) by 11.66% in Easy and 20.76% in Challenge.(2) BERT is inherently a black-box model, not be-ing entirely possible to explain its prediction. Bycontrast, ExplanationLP is fully explainable andproduces a complete explanatory graph.(3) Similar to ExplanationLP, PathNet is also ex-plainable and demonstrates robustness to noise.

7

CASE I: All the selected facts are in the gold explanation (Frequency: 33%)

Question: A company wants to make a game that uses a magnet that sticks to a board. Which material should it use forthe board? Answer: steelExplanations: (1) steel is a metal (Grounding), (2) if a magnet is attracted to a metal then that magnet will stick to thatmetal (Abstract), (3) a magnet attracts magnetic metals through magnetism (Abstract),

CASE II: At least one selected facts are in the gold explanation (Frequency: 58%)

Question: A large piece of ice is placed on the sidewalk on a warm day. What will happen to the ice? Answer: It willmelt to form liquid water.Explanations: (1) drop is liquid small amount (Grounding), (2) forming something is change (Grounding), (3) icewedging is mechanical weathering (Grounding), (4) melting means changing from a solid into a liquid by adding heatenergy (Abstract), (5) weathering means breaking down surface materials from larger whole into smaller pieces byweather (Abstract),

CASE III: No retrieved facts is in the gold explanation (Frequency: 9%)

Question:Wind is a natural resource that benefits the southeastern shore of the Chesapeake Bay. How could these windsbest benefit humans? Answer: The winds could be converted to electrical energyExplanations: (1) renewable resource is natural resource (Grounding), (2) wind is a renewable resource (Abstract), (3)electrical devices convert electricity into other forms of energy (Abstract)

Table 4: Case study of explanation extracted by ExplanationLP

ExplanationLP also outperforms PathNet’s bestperformance setting (#8) by 18.59% in Easy and16.60% in Challenge.(4) ExplanationLP consistently exhibits betterscores on both BM25 and UR than BERT and Path-Net, demonstrating independence of the upstreamretrieval model for performance.

ARC-Challenge : We also evaluated our modelon the ARC-Challenge corpus (Clark et al., 2018)to evaluate ExplanationLP on a more extensivegeneral question set and compare against contem-porary approaches that provide explanations foran inference that has only been trained on ARCcorpus. Table 3 reports the results on the test-set.We compare ExplanationLP against published ap-proaches that are fully/partly explainable. Hereexplainability indicates if the model produces anexplanation/evidence for the predicted answer. Asubset of the approaches produces evidence for theanswer but remains intrinsically black-box. Thesemodels have been marked as Partial.

As depicted in the Table 3, we outperform thebest performing fully explainable (#4 TableILP)model by 13.28%. We also outperform specificneural approaches with larger parameter sets (#5- #9) that provide explanations for their inferenceand BERT (#1). Despite having a smaller numberof training parameters, we also exhibit competitiveperformance with a state-of-the-art Bert-based ap-proach (#10) that do not use external resources totrain the QA system.

3.2 Explainability

Approach Precision Recall F1

PathNet + UR (k=20) 21.56 36.55 29.06Ours + UR (k=30) 57.96 49.92 48.13

Table 5: Explanation retrieval performance on theWorldTree Corpus dev-set.

Table 5 shows the Precision, Recall and F1Macro

score for explanation retrieval for PathNet and Ex-planationLP. These scores are computed using goldabstract explanations from WorldTree corpus. Weoutperform PathNet across all spectrum by a sig-nificant margin.

Table 4 reports three representative cases thatshow how explanation generation relates to cor-rect answer prediction. The first example (Case I)represents the situation in which all the selectedsentences are annotated as gold explanations in theWorldTree corpus (dev-set). The second example(Case II) shows the case in which at least one sen-tence in the explanation is labelled as gold. Finally,the third example (Case III) represents the casein which the explanation generated by the methoddoes not contain any gold fact. We observe CaseI and Case II occur over 91% of the questions,demonstrating that the correct answers are mostlyderived from plausible explanations.

3.3 Ablation StudyIn order to understand the contribution lent bydifferent components, we choose the best setting

8

# Approach AccuracyWT ARC

1 ExplanationLP (Best) 61.37 40.21

Structure2 Grounding-Abstract Categories 58.33 35.133 Edge weights 43.78 29.454 Node weights 42.80 27.87

Cohesion5 Hypothesis-Abstract cohesion 38.71 30.376 Hypothesis-Grounding cohesion 59.33 38.737 Grounding-Abstract cohesion 59.12 38.14

Diversity8 Abstract-Abstract diversity 60.16 37.629 Grounding-Grounding diversity 60.44 37.71

Relevance10 Hypothesis-Abstract semantic similarity 55.38 35.4911 Hypothesis-Abstract lexical relevance 54.68 36.01

Table 6: Ablation study, removing different components ofExplanationLP. The scores reported here are accuracy foranswer selection on the WorldTree (WT) and ARC-Challenge(ARC) test-set.

Figure 2: Change in accuracy of answer prediction the de-velopment set varying across different models with increasingexplanation length for WorldTree dev-set. Red dashed linerepresents ExplanationLP + UR (k=30), blue line representsBERTLarge + UR (k=10) and green dotted line representsPathNet + UR (k=20)

(WorldTree: ExplanationLP + UR (k=30) and ARC:ExplanationLP + BM25 (k=40)) and drop differentcomponents to perform an ablation analysis. We re-tain the ensemble after removing each component.The results are summarized in Table 6.(1) The grounding-abstract chains (#2) play a sig-nificant role, particularly in the reasoning mech-anism on a challenging question set like ARC-Challenge.(2) As observed in #3, #4 removing node weightsand edge weights lead to a dramatic drop in perfor-mance. This drop indicates that both are fundamen-tal for the final prediction, highlighting the role ofgraph structure in explainable inference.(3) The importance of cohesion varies across dif-ferent types of facts. We observe that Hypothesis-

Abstract cohesion (#5) is significantly more impor-tant than the others. We attribute this to the fact thatwithout Hypothesis-Abstract cohesion, multi-hopinference can quickly go out of context.(4) From the ablation analysis, we can see howlexical relevance and semantic similarity (#10, 11)complements each other towards the final predic-tion. For WorldTree corpus, the relevance score hasa higher parameter score translating into a higherimpact and vice-versa for ARC.(5) Diversity plays a smaller role when comparedto cohesion and relevance. The impact of diversityin ARC is higher than that of WorldTree.

Semantic Drift To validate the performanceacross an increasing number of hops, we plot theaccuracy against explanation length as illustratedin Figure 2. As demonstrated in explanation regen-eration (Valentino et al., 2021; Jansen and Ustalov,2019), the complexity of a science question is di-rectly correlated with the explanation length – i.e.the number of facts required in the gold explanation.Unlike BERT, PathNet and ExplanationLP use ex-ternal background knowledge, addressing the multi-hop process in two main reasoning steps. However,in contrast to ExplanationLP, PathNet combinesonly two explanatory facts to answer a given ques-tion. This assumption has a negative impact onanswering complex questions requiring long expla-nations. This is evident in the graph, where we ob-serve a sharp decrease in accuracy with increasingexplanation length. Comparatively, ExplanationLPachieves more stable performance, showing a lowerdegradation with an increasing number of explana-tion sentences. These results crucially demonstratethe positive impact of grounding-abstract mech-anisms on semantic drift. We also exhibit con-sistently better performance when compared withBERT as well.

4 Related Work

Our approach broadly falls into Linear Program-ming based approaches for science question an-swering. LP-based approaches perform inferenceover either semi-structured tables (Khashabi et al.,2016) or structural representations extracted fromthe text (Khashabi et al., 2018; Khot et al., 2017a).These approaches treat all facts homogeneouslyand attempt to connect the question with the cor-rect answer through long hops. While they haveexhibited good performance with no supervision,the performance tends to be lower when answer-

9

ing complex questions requiring long explanatorychains. In contrast, our approach performs infer-ence over unstructured text by imposing structuralconstraints via grounding-abstract chains, loweringthe hops, and also combine parametric optimisationto extract the best performing model.

The other class of approaches that provide ex-planations are graph-based approaches. Graph-based approaches have been successfully appliedfor open-domain question answering (Fang et al.,2020; Qiu et al., 2019; Thayaparan et al., 2019)where the question only requires only two hops.PathNet (Kundu et al., 2019) operates within thesame design principles and has been applied onOpenbookQA science dataset. As indicated inthe empirical evaluation, it struggles with long-chain explanations since it relies only on two facts.Graph-based approaches have also been employedfor mathematical reasoning (Ferreira and Freitas,2020a,b) and textual entailment (Silva et al., 2019,2018).

The third category of partially explainable ap-proaches employs black-box neural models in com-bination with a retrieval approach. The SOTAmodel for Science Question (Khashabi et al., 2020)answering is pretrained across multiple datasetsand is not explainable. The current partially ex-plainable SOTA approach that does not rely onexternal resource (Yadav et al., 2019b) employsa large parameter BERT model for question an-swering resulting. In contrast, with a low numberof parameters, we have introduced a model thatdemonstrates competitive performance and leavesa smaller carbon footprint in terms of energy con-sumption (Henderson et al., 2020). Other methodsconstruct explanation chains by leveraging explana-tory patterns emerging in a corpus of scientific ex-planations (Valentino et al., 2020, 2021).

5 Conclusion

This paper presented a robust, explainable and ef-ficient science question answering model that per-forms abductive natural language inference. Wealso presented an in-depth systematic evaluationdemonstrating the impact on the various set of de-sign principles via an in-depth ablation analysis.Despite having a significantly lower number ofparameters, we demonstrated competitive perfor-mance compared with contemporary explainableapproaches while also showcasing its robustness,explainability and interpretability.

Acknowledgements

The authors would like to thank the anonymousreviewers for the constructive feedback. A specialthanks to Deborah Ferreira for the helpful discus-sions and comments. Additionally, we would liketo thank the Computational Shared Facility of theUniversity of Manchester for providing the infras-tructure to run our experiments.

ReferencesOr Biran and Courtenay Cotton. 2017. Explanation

and justification in machine learning: A survey. InIJCAI-17 workshop on explainable AI (XAI), vol-ume 8, pages 8–13.

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot,Ashish Sabharwal, Carissa Schoenick, and OyvindTafjord. 2018. Think you have solved question an-swering? try arc, the ai2 reasoning challenge. arXivpreprint arXiv:1803.05457.

Peter Clark, Oren Etzioni, Tushar Khot, Ashish Sab-harwal, Oyvind Tafjord, Peter D Turney, and DanielKhashabi. 2016. Combining retrieval, statistics, andinference to answer elementary science questions.In AAAI, pages 2580–2586. Citeseer.

Peter Clark, Philip Harrison, and Niranjan Balasubra-manian. 2013. A study of the knowledge base re-quirements for passing an elementary science test.In Proceedings of the 2013 workshop on Automatedknowledge base construction, pages 37–42.

Dorottya Demszky, Kelvin Guu, and Percy Liang. 2018.Transforming question answering datasets into nat-ural language inference datasets. arXiv preprintarXiv:1809.02922.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers), pages4171–4186.

Yuwei Fang, Siqi Sun, Zhe Gan, Rohit Pillai, Shuo-hang Wang, and Jingjing Liu. 2020. Hierarchicalgraph network for multi-hop question answering. InProceedings of the 2020 Conference on EmpiricalMethods in Natural Language Processing (EMNLP),pages 8823–8838.

Deborah Ferreira and Andre Freitas. 2020a. Natu-ral language premise selection: Finding supportingstatements for mathematical text. In Proceedings ofThe 12th Language Resources and Evaluation Con-ference, pages 2175–2182.

10

Deborah Ferreira and Andre Freitas. 2020b. Premiseselection in natural language mathematical texts. InProceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 7365–7374.

Daniel Fried, Peter Jansen, Gustave Hahn-Powell, Mi-hai Surdeanu, and Peter Clark. 2015. Higher-order lexical semantic models for non-factoid an-swer reranking. Transactions of the Association forComputational Linguistics, 3:197–210.

Peter Henderson, Jieru Hu, Joshua Romoff, EmmaBrunskill, Dan Jurafsky, and Joelle Pineau. 2020.Towards the systematic reporting of the energy andcarbon footprints of machine learning. Journal ofMachine Learning Research, 21(248):1–43.

Peter Jansen, Niranjan Balasubramanian, Mihai Sur-deanu, and Peter Clark. 2016. What’s in an expla-nation? characterizing knowledge and inference re-quirements for elementary science exams. In Pro-ceedings of COLING 2016, the 26th InternationalConference on Computational Linguistics: Techni-cal Papers, pages 2956–2965.

Peter Jansen, Rebecca Sharp, Mihai Surdeanu, and Pe-ter Clark. 2017. Framing qa as building and rankingintersentence answer justifications. ComputationalLinguistics, 43(2):407–449.

Peter Jansen and Dmitry Ustalov. 2019. Textgraphs2019 shared task on multi-hop inference for expla-nation regeneration. In Proceedings of the Thir-teenth Workshop on Graph-Based Methods for Nat-ural Language Processing (TextGraphs-13), pages63–77.

Peter Jansen, Elizabeth Wainwright, Steven Mar-morstein, and Clayton Morrison. 2018. Worldtree:A corpus of explanation graphs for elementary sci-ence questions supporting multi-hop inference. InProceedings of the Eleventh International Confer-ence on Language Resources and Evaluation (LREC2018).

Peter A Jansen. 2018. Multi-hop inference forsentence-level textgraphs: How challenging is mean-ingfully combining information for science questionanswering? NAACL HLT 2018, page 12.

Daniel Khashabi, Erfan Sadeqi Azer, Tushar Khot,Ashish Sabharwal, and Dan Roth. 2019. Onthe capabilities and limitations of reasoning fornatural language understanding. arXiv preprintarXiv:1901.02522.

Daniel Khashabi, Tushar Khot, Ashish Sabharwal, Pe-ter Clark, Oren Etzioni, and Dan Roth. 2016. Ques-tion answering via integer programming over semi-structured knowledge. In IJCAI International JointConference on Artificial Intelligence, volume 2016,pages 1145–1152.

Daniel Khashabi, Tushar Khot, Ashish Sabharwal, andDan Roth. 2018. Question answering as global rea-soning over semantic abstractions. Conference ofAssociation for the Advancement of Artificial Intel-ligence.

Daniel Khashabi, Sewon Min, Tushar Khot, AshishSabharwal, Oyvind Tafjord, Peter Clark, and Han-naneh Hajishirzi. 2020. Unifiedqa: Crossing formatboundaries with a single qa system. In Proceed-ings of the 2020 Conference on Empirical Methodsin Natural Language Processing: Findings, pages1896–1907.

Tushar Khot, Ashish Sabharwal, and Peter Clark.2017a. Answering complex questions using open in-formation extraction. In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 2: Short Papers), pages 311–316.

Tushar Khot, Ashish Sabharwal, and Peter Clark.2017b. Answering complex questions using openinformation extraction. Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 2: Short Papers).

Souvik Kundu, Tushar Khot, Ashish Sabharwal, andPeter Clark. 2019. Exploiting explicit paths formulti-hop reading comprehension. In Proceedingsof the 57th Annual Meeting of the Association forComputational Linguistics, pages 2737–2747.

Peter Lipton. 2017. Inference to the best explanation.A Companion to the Philosophy of Science, pages184–193.

Alexander A Loboda, Maxim N Artyomov, andAlexey A Sergushichev. 2016. Solving generalizedmaximum-weight connected subgraph problem fornetwork enrichment analysis. In International Work-shop on Algorithms in Bioinformatics, pages 210–221. Springer.

Tim Miller. 2019. Explanation in artificial intelligence:Insights from the social sciences. Artificial Intelli-gence, 267:1–38.

Jianmo Ni, Chenguang Zhu, Weizhu Chen, and JulianMcAuley. 2019. Learning to attend on essentialterms: An enhanced retriever-reader model for open-domain question answering. In Proceedings of the2019 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, Volume 1 (Long andShort Papers), pages 335–344.

Charles Sanders Peirce. 1960. Collected papers ofcharles sanders peirce, volume 2. Harvard Univer-sity Press.

Lin Qiu, Yunxuan Xiao, Yanru Qu, Hao Zhou, Lei Li,Weinan Zhang, and Yong Yu. 2019. Dynamicallyfused graph network for multi-hop reasoning. InProceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 6140–6150.

https://doi.org/10.18653/v1/p17-2049

https://doi.org/10.18653/v1/p17-2049

11

Nils Reimers, Iryna Gurevych, Nils Reimers, IrynaGurevych, Nandan Thakur, Nils Reimers, JohannesDaxenberger, and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processing.Association for Computational Linguistics.

Vivian Dos Santos Silva, Siegfried Handschuh, andAndre Freitas. 2018. Recognizing and justifyingtext entailment through distributional navigation ondefinition graphs. In AAAI, pages 4913–4920.

Vivian S Silva, Andre Freitas, and Siegfried Hand-schuh. 2019. Exploring knowledge graphs in an in-terpretable composite approach for text entailment.In Proceedings of the AAAI Conference on ArtificialIntelligence, volume 33, pages 7023–7030.

Robyn Speer, Joshua Chin, and Catherine Havasi. 2017.Conceptnet 5.5: An open multilingual graph of gen-eral knowledge. In Proceedings of the AAAI Confer-ence on Artificial Intelligence, volume 31.

Kai Sun, Dian Yu, Dong Yu, and Claire Cardie. 2019.Improving machine reading comprehension withgeneral reading strategies. Proceedings of the 2019Conference of the North.

Mokanarangan Thayaparan, Marco Valentino, andAndre Freitas. 2020. A survey on explainabilityin machine reading comprehension. arXiv preprintarXiv:2010.00389.

Mokanarangan Thayaparan, Marco Valentino, ViktorSchlegel, and Andre Freitas. 2019. Identifyingsupporting facts for multi-hop question answeringwith document graph networks. In Proceedings ofthe Thirteenth Workshop on Graph-Based Methodsfor Natural Language Processing (TextGraphs-13),pages 42–51.

Marco Valentino, Mokanarangan Thayaparan, andAndre Freitas. 2020. Explainable natural languagereasoning via conceptual unification. arXiv preprintarXiv:2009.14539.

Marco Valentino, Mokanarangan Thayaparan, and An-dre Freitas. 2021. Unification-based reconstructionof multi-hop explanations for science questions. InProceedings of the 16th Conference of the EuropeanChapter of the Association for Computational Lin-guistics, pages 200–211. Association for Computa-tional Linguistics.

Zhengnan Xie, Sebastian Thiem, Jaycie Martin, Eliz-abeth Wainwright, Steven Marmorstein, and PeterJansen. 2020. WorldTree v2: A corpus of science-domain structured explanations and inference pat-terns supporting multi-hop inference. In Proceed-ings of the 12th Language Resources and EvaluationConference, pages 5456–5473, Marseille, France.European Language Resources Association.

Vikas Yadav, Steven Bethard, and Mihai Surdeanu.2019a. Alignment over heterogeneous embeddingsfor question answering. In Proceedings of the 2019Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, Volume 1 (Long and ShortPapers), pages 2681–2691.

Vikas Yadav, Steven Bethard, and Mihai Surdeanu.2019b. Quick and (not so) dirty: Unsupervised se-lection of justification sentences for multi-hop ques-tion answering. In Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP), pages 2578–2589, Hong Kong, China. As-sociation for Computational Linguistics.

Yuyu Zhang, Hanjun Dai, Kamil Toraman, andLe Song. 2018. Kgˆ 2: Learning to reason scienceexam questions with contextual knowledge graphembeddings. arXiv preprint arXiv:1805.12393.

A Supplementary Material

This section consists of all the hyperparameters,code and libaries used in our approach. We presentthis in the hope it fosters reproducibility.

A.1 Linear Programming Optimization

The components of the linear programming systemis as follows:

• Solver: CPLEX optimization studioV12.9.0 https://www.ibm.com/products/

ilog-cplex-optimization-studio

The hyperparatemers used in the LP constraints:

• Maximum number of abstract facts (K): 2

• Average time per epoch: 6 minutes for train-set

• Number of Epochs: 200

Infrastructures used:

• CPU Cores: 32

• CPU Model: Intel(R) Core(TM) i7-6700 CPU@ 3.40GHz

• Memory: 128GB

• OS: Ubuntu 18.04 LTS

https://doi.org/10.18653/v1/n19-1270

https://doi.org/10.18653/v1/n19-1270

https://www.aclweb.org/anthology/2020.lrec-1.671



https://doi.org/10.18653/v1/D19-1260

https://doi.org/10.18653/v1/D19-1260

https://doi.org/10.18653/v1/D19-1260

https://www.ibm.com/products/ilog-cplex-optimization-studio

https://www.ibm.com/products/ilog-cplex-optimization-studio

12

A.2 Parameter tuning

Our work employed Bayesian optmiza-tion with Gaussian process for hyper-paramter tuning. We used the https:

//github.com/fmfn/BayesianOptimization:Bayesian-Optimization python library to im-plement the code. These parameters are asfollows:

• Gaussian Kernels:

– RationalQuadratic Kernel with defaultparameters

– WhiteKernel with noise level of 1e-5,noise level bounds (1e-10, 1e1) and restof the default parameters

• Number of iterations: 200

• alpha (α): 1e-8

• random state: 1

A.3 Sentence-BERT for Semantic SimilarityScores

We use: roberta-large nli-stsb mean-tokens modelto calculate the semantic similarity scores.

A.4 BERT model

The BERT model was taken from the Hug-ginface Transformers (https://github.com/huggingface/transformers) library and fine-tuned using 4 Tesla V100 GPUs for 10 epochs intotal with batch size 16 for BERTLarge and 32 forfor BERTBase. The hyperparameters adopted forBERT are as follows:

• gradient accumulation steps: 1

• learning rate: 1e-5

• weight decay: 0.0

• adam epsilon: 1e-8

• warmup steps: 0

• max grad norm: 1.0

• seed: 42

A.5 PathNetWe use the code and dependencies pro-vided by the PathNet github repository(https://github.com/allenai/PathNet).We used the training config providedfor OpenBookQA as a baseline: https:

//github.com/allenai/PathNet, file name:blob/master/training configs/config obqa.json.

A.6 Relevant facts retrievalThe code for BM25 and Unification retrievalapproaches were adopted from the Unifi-cation Explanation Retrieval GitHub repos-itory (https://github.com/ai-systems/unification_reconstruction_explanations).

A.7 CodeThe code for reproducing the ExplanationLP andthe experiments described in this paper are at-tached with the code appendix and will be avail-able at the following GitHub repository (witha Dockerized container): https://github.com/ai-systems/explanationlp.

A.8 DataWorldTree Dataset : The 2,290 questions in theWorldTree corpus are split into three different sub-sets: train-set (987), dev-set (226), and test-set(1,077). We only considered questions with expla-nations for our evaluation. The reasoning behindomitting questions without explanations was to en-sure fact coverage for all questions. For Abstrac-tKB building we excluded facts from ’KINDOF’and ’SYNONYMY’ table, as these are the one pri-marily composed of grounding facts.

ARC-Challenge Dataset : Only used the Chal-lenge split: https://allenai.org/data/arc.

https://github.com/fmfn/BayesianOptimization

https://github.com/fmfn/BayesianOptimization

https://github.com/huggingface/transformers

https://github.com/huggingface/transformers

https://github.com/allenai/PathNet



https://github.com/ai-systems/unification_reconstruction_explanations

https://github.com/ai-systems/unification_reconstruction_explanations

https://github.com/ai-systems/explanationlp

https://github.com/ai-systems/explanationlp

https://allenai.org/data/arc

Documents

Explainable Inference Over Grounding-Abstract Chains for