14
Scalable Multi-Hop Relational Reasoning for Knowledge-Aware Question Answering Yanlin Feng * Xinyue Chen ♠* Bill Yuchen Lin Peifeng Wang Jun Yan Xiang Ren [email protected], [email protected], {yuchen.lin, peifengw, yanjun, xiangren}@usc.edu University of Southern California Peking University Carnegie Mellon University Abstract Existing work on augmenting question an- swering (QA) models with external knowl- edge (e.g., knowledge graphs) either strug- gle to model multi-hop relations efficiently, or lack transparency into the model’s predic- tion rationale. In this paper, we propose a novel knowledge-aware approach that equips pre-trained language models (PTLMs) with a multi-hop relational reasoning module, named multi-hop graph relation network (MHGRN). It performs multi-hop, multi-relational reason- ing over subgraphs extracted from external knowledge graphs. The proposed reasoning module unifies path-based reasoning methods and graph neural networks to achieve better in- terpretability and scalability. We also empiri- cally show its effectiveness and scalability on CommonsenseQA and OpenbookQA datasets, and interpret its behaviors with case studies 1 . 1 Introduction Many recently proposed question answering tasks require not only machine comprehension of the question and context, but also relational reason- ing over entities (concepts) and their relationships by referencing external knowledge (Talmor et al., 2019; Sap et al., 2019; Clark et al., 2018; Mihaylov et al., 2018). For example, the question in Fig. 1 requires a model to perform relational reasoning over mentioned entities, i.e., to infer latent rela- tions among the concepts: {CHILD,S IT,DESK, SCHOOLROOM}. Background knowledge such as “a child is likely to appear in a schoolroom” may not be readily contained in the questions themselves, but are commonsensical to humans. Despite the success of large-scale pre-trained language models (PTLMs) (Devlin et al., 2019; * The first two authors contributed equally. The major work was done when both authors interned at USC. 1 https://github.com/INK-USC/MHGRN Desires Learn Child Schoolroom Classroom Sit Desk AtLocation Where does a child likely sit at a desk? A. Schoolroom * B. Furniture store C. Patio D. Office building E. Library Figure 1: Illustration of knowledge-aware QA. A sample question from CommonsenseQA can be better answered if a relevant subgraph of ConceptNet is pro- vided as evidence. Blue nodes correspond to entities mentioned in the question, pink nodes correspond to those in the answer. The other nodes are some associ- ated entities introduced when extracting the subgraph. indicates the correct answer. Liu et al., 2019b), these models fall short of pro- viding interpretable predictions, as the knowledge in their pre-training corpus is not explicitly stated, but rather is implicitly learned. It is thus difficult to recover the evidence used in the reasoning process. This has led many to leverage knowledge graphs (KGs) (Mihaylov and Frank, 2018; Lin et al., 2019; Wang et al., 2019; Yang et al., 2019). KGs represent relational knowledge between en- tities with multi-relational edges for models to acquire. Incorporating KGs brings the potential of interpretable and trustworthy predictions, as the knowledge is now explicitly stated. For ex- ample, in Fig. 1, the relational path (CHILD AtLocation CLASSROOM Synonym SCHOOLROOM) naturally provides evidence for the answer SCHOOLROOM. arXiv:2005.00646v2 [cs.CL] 18 Sep 2020

arXiv:2005.00646v1 [cs.CL] 1 May 20200 80 #node 3.1e3 #path K=2 0 80 #node 6.3e4 K=2 K=3 Figure 2: Number of K-hop relational paths w.r.t. the node count in extracted graphs on Common-senseQA

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: arXiv:2005.00646v1 [cs.CL] 1 May 20200 80 #node 3.1e3 #path K=2 0 80 #node 6.3e4 K=2 K=3 Figure 2: Number of K-hop relational paths w.r.t. the node count in extracted graphs on Common-senseQA

Scalable Multi-Hop Relational Reasoning forKnowledge-Aware Question Answering

Yanlin Fengclubslowast Xinyue Chenspadeslowast Bill Yuchen Linhearts Peifeng Wanghearts Jun Yanhearts Xiang Renhearts

fengyanlinpkueducn xinyuechandrewcmueduyuchenlin peifengw yanjun xiangrenuscedu

heartsUniversity of Southern CaliforniaclubsPeking University spadesCarnegie Mellon University

Abstract

Existing work on augmenting question an-swering (QA) models with external knowl-edge (eg knowledge graphs) either strug-gle to model multi-hop relations efficientlyor lack transparency into the modelrsquos predic-tion rationale In this paper we propose anovel knowledge-aware approach that equipspre-trained language models (PTLMs) with amulti-hop relational reasoning module namedmulti-hop graph relation network (MHGRN)It performs multi-hop multi-relational reason-ing over subgraphs extracted from externalknowledge graphs The proposed reasoningmodule unifies path-based reasoning methodsand graph neural networks to achieve better in-terpretability and scalability We also empiri-cally show its effectiveness and scalability onCommonsenseQA and OpenbookQA datasetsand interpret its behaviors with case studies1

1 Introduction

Many recently proposed question answering tasksrequire not only machine comprehension of thequestion and context but also relational reason-ing over entities (concepts) and their relationshipsby referencing external knowledge (Talmor et al2019 Sap et al 2019 Clark et al 2018 Mihaylovet al 2018) For example the question in Fig 1requires a model to perform relational reasoningover mentioned entities ie to infer latent rela-tions among the concepts CHILD SIT DESKSCHOOLROOM Background knowledge such as

ldquoa child is likely to appear in a schoolroomrdquo may notbe readily contained in the questions themselvesbut are commonsensical to humans

Despite the success of large-scale pre-trainedlanguage models (PTLMs) (Devlin et al 2019

lowast The first two authors contributed equally The majorwork was done when both authors interned at USC

1httpsgithubcomINK-USCMHGRN

Des

ires

Learn

Child

Schoolroom

Classroom

Sit

Desk

AtLo

cation

Where does a child likely sit at a desk

A Schoolroom B Furniture store C PatioD Office building E Library

Figure 1 Illustration of knowledge-aware QA Asample question from CommonsenseQA can be betteranswered if a relevant subgraph of ConceptNet is pro-vided as evidence Blue nodes correspond to entitiesmentioned in the question pink nodes correspond tothose in the answer The other nodes are some associ-ated entities introduced when extracting the subgraph⋆ indicates the correct answer

Liu et al 2019b) these models fall short of pro-viding interpretable predictions as the knowledgein their pre-training corpus is not explicitly statedbut rather is implicitly learned It is thus difficult torecover the evidence used in the reasoning process

This has led many to leverage knowledgegraphs (KGs) (Mihaylov and Frank 2018 Linet al 2019 Wang et al 2019 Yang et al 2019)KGs represent relational knowledge between en-tities with multi-relational edges for models toacquire Incorporating KGs brings the potentialof interpretable and trustworthy predictions asthe knowledge is now explicitly stated For ex-ample in Fig 1 the relational path (CHILD rarr

AtLocation rarr CLASSROOM rarr Synonym rarrSCHOOLROOM) naturally provides evidence forthe answer SCHOOLROOM

arX

iv2

005

0064

6v2

[cs

CL

] 1

8 Se

p 20

20

0 80node

31e3

pat

hK=2

0 80node

63e4K=2K=3

Figure 2 Number of K-hop relational paths wrtthe node count in extracted graphs on Common-senseQA Left The path count is polynomial wrt thenumber of nodes Right The path count is exponentialwrt the number of hops

A straightforward approach to leveraging aknowledge graph is to directly model these rela-tional paths KagNet (Lin et al 2019) and MH-PGM (Bauer et al 2018) model multi-hop relationsby extracting relational paths from KG and encod-ing them with sequence models Application ofattention mechanisms upon these relational pathscan further offer good interpretability Howeverthese models are hardly scalable because the num-ber of possible paths in a graph is (1) polynomialwrt the number of nodes (2) exponential wrt thepath length (see Fig 2) Therefore some (Weis-senborn et al 2017 Mihaylov and Frank 2018)resort to only using one-hop paths namely triplesto balance scalability and reasoning capacities

Graph neural networks (GNNs) in contrast en-joy better scalability via their message passing for-mulation but usually lack transparency The mostcommonly used GNNsrsquo variant Graph Convolu-tional Networks (GCNs) (Kipf and Welling 2017)perform message passing by aggregating neighbor-hood information for each node but ignore the rela-tion types RGCNs (Schlichtkrull et al 2018) gen-eralize GCNs by performing relation-specific ag-gregation making it applicable to multi-relationalgraphs However these models do not distinguishthe importance of different neighbors or relationtypes and thus cannot provide explicit relationalpaths for model behavior interpretation

In this paper we propose a novel graph encod-ing architecture Multi-hop Graph Relation Net-work (MHGRN) which combines the strengths ofpath-based models and GNNs Our model inheritsscalability from GNNs by preserving the messagepassing formulation It also enjoys interpretabilityof path-based models by incorporating structuredrelational attention mechanism Our key motivationis to perform multi-hop message passing within a

GCN RGCN KagNet MHGRN

Multi-Relational Encoding 7 3 3 3Interpretable 7 7 3 3

Scalable wrt node 3 3 7 3Scalable wrt hop 3 3 7 3

Table 1 Properties of our MHGRN and other repre-sentative models for graph encoding

single layer to allow each node to directly attend toits multi-hop neighbours towards multi-hop rela-tional reasoning We outline the favorable featuresof knowledge-aware QA models in Table 1 andcompare MHGRN with them

We summarize the main contributions of thiswork as follows 1) We propose MHGRN a novelmodel architecture tailored to multi-hop relationalreasoning which explicitly models multi-hop rela-tional paths at scale 2) We propose a structuredrelational attention mechanism for efficient and in-terpretable modeling of multi-hop reasoning pathsalong with its training and inference algorithms 3)We conduct extensive experiments on two questionanswering datasets and show that our models bringsignificant improvements compared to knowledge-agnostic PTLMs and outperform other graph en-coding methods by a large margin

2 Problem Formulation and Overview

In this paper we limit the scope to the task ofmultiple-choice question answering although itcan be easily generalized to other knowledge-guided tasks (eg natural language inference) Theoverall paradigm of knowledge-aware QA is illus-trated in Fig 3 Formally given an external knowl-edge graph (KG) as the knowledge source and aquestion q our goal is to identify the correct answerfrom a set C of given options We turn this probleminto measuring the plausibility score between q andeach option a isin C then selecting the one with thehighest plausibility score

Denote the vector representations of question qand option a as q and a To measure the scorefor q and a we first concatenate them to forma statement vector s = [qa] Then we extractfrom the external KG a subgraph G (ie schemagraph in KagNet (Lin et al 2019)) with the guid-ance of s (detailed in sect51) This contextualizedsubgraph is defined as a multi-relational graphG = (V E φ) Here V is a subset of entity inthe external KG containing only those relevant tos E sube V timesR times V is the set of edges that connectnodes in V where R = 1⋯m are ids of all

PlausibilityScore

Graph ExtractionFrom KG

Graph Encoder

Text Encoder

120022

119956119954

119938

Figure 3 Overview of the knowledge-aware QAframework It integrates the output from graph en-coder (for relational reasoning over contextual sub-graphs) and text encoder (for textual understanding) togenerate the plausibility score for an answer option

pre-defined relation types The mapping functionφ(i) ∶ V rarr T = EqEaEo takes node i isin V asinput and outputs Eq if i is an entity mentioned inq Ea if it is mentioned in a or Eo otherwise Wefinally encode the statement to s G to g concate-nate s and g for calculating the plausibility score

3 Background Multi-Relational GraphEncoding Methods

We leave encoding of s to pre-trained languagemodels and focus on the challenge of encodinggraph G to capture latent relations between enti-ties Current methods for encoding multi-relationalgraphs mainly fall into two categories GNNs andpath-based models GNNs encode structured in-formation by passing messages between nodes di-rectly operating on the graph structure while path-based methods first decompose the graph into pathsand then pool features over them

Graph Encoding with GNNs For a graph withn nodes a graph neural network (GNN) takes aset of node features h1h2 hn as input andcomputes their corresponding node embeddingshprime1hprime2 hprimen via message passing (Gilmeret al 2017) A compact graph representation forG can thus be obtained by pooling over the nodeembeddings hprimei

GNN(G) = Pool(hprime1hprime2 hprimen) (1)

As a notable variant of GNNs graph convo-lutional networks (GCNs) (Kipf and Welling2017) additionally update node embeddings byaggregating messages from its direct neighborsRGCNs (Schlichtkrull et al 2018) extend GCNs toencode multi-relational graphs by defining relation-

specific weight matrix Wr for each edge type

hprimei = σ

⎛⎜⎝(sumrisinR

∣N ri ∣)

minus1

sumrisinR

sumjisinN r

i

Wrhj⎞⎟⎠ (2)

where N ri denotes neighbors of node i under rela-

tion r2

While GNNs have proved to have good scala-bility their reasoning is done at the node levelmaking them incompatible with modeling pathsThis property also hinders the modelrsquos decisionsfrom being interpretable at the path levelGraph Encoding with Path-Based Models Inaddition to directly modeling the graph with GNNsa graph can also be viewed as a set of relationalpaths connecting pairs of entities

Relation Networks (RNs) (Santoro et al 2017)can be adapted to multi-relational graph encodingunder QA settings RNs use MLPs to encode alltriples (one-hop paths) in G whose head entity isin Q = j ∣ φ(j) = Eq and tail entity is inA = i ∣ φ(i) = Ea It then pools the tripleembeddings to generate a vector for G as follows

RN(G) = Pool(MLP(hj oplus eroplus

hi) ∣ j isin Q i isin A (j r i) isin E) (3)

Here hj and hi are features for nodes j and i er isthe embedding of relation r isin R oplus denotes vectorconcatenation

To further equip RN with the ability to modelnondegenerate paths KagNet (Lin et al 2019)adopts LSTMs to encode all paths connecting ques-tion entities and answer entities with lengths nomore than K It then aggregates all path embed-dings via attention mechanism

KAGNET(G) = Pool(LSTM(j r1 rk i) ∣

(j r1 j1)⋯ (jkminus1 rk i) isin E 1 le k le K)(4)

4 Proposed Method Multi-Hop GraphRelation Network (MHGRN)

This section presents Multi-hop Graph RelationNetwork (MHGRN) a novel GNN architecturethat unifies both GNNs and path-based modelsMHGRN inherits path-level reasoning and inter-pretabilty from path-based models while preserv-ing good scalability of GNNs

2For simplicity we assume a single graph convolutional

AttentivePooling

2 Multi-hop Message Passing

Node Features

Text Encodereg BERT

1 Node Type Specific Transform

3 Nonlinear Activation

timesLayer

PlausibilityScore

MLP

120022 119956

AggregateMessages

120630(119947 119955120783 hellip 119955119948 119946)

Structured Relational Att

Learn

Child

Child CapableOf UsedForminus1 AtLocation

Desires UsedForminus1 Synonym

AtLocation Synonym

Sit AtLocationUsedForminus1

Desk AtLocation

hellipMulti-hop Message Passing

Figure 4 Our proposed MHGRN architecturefor relational reasoning MHGRN takes a multi-relational graph G and a (question-answer) statementvector s as input and outputs a scalar that represent theplausibility score of this statement

41 MHGRN Model ArchitectureWe follow the GNN framework introduced in sect3where node features can be initialized with pre-trained weights (details in Appendix C) Here wefocus on the computation of node embeddings

Type-Specific Transformation To make ourmodel aware of the node type φ we first performnode type specific linear transformation on the in-put node features

xi = Uφ(i)hi + bφ(i) (5)

where the learnable parameters U and b are specificto the type of node i

Multi-Hop Message Passing As mentioned be-fore our motivation is to endow GNNs with thecapability of directly modeling paths To this endwe propose to pass messages directly over all therelational paths of lengths up to K The set of validk-hop relational paths is defined as

Φk = (j r1 rk i) ∣ (j r1 j1)⋯ (jkminus1 rk i) isin E (1 le k le K) (6)

We perform k-hop (1 le k le K) message passingover these paths which is a generalization of the

layer In practice multiple layers are stacked to enable mes-sage passing from multi-hop neighbors

single-hop message passing in RGCNs (see Eq 2)

zki = sum

(jr1rki)isinΦk

α(j r1 rk i)dki sdotWK0

⋯Wk+10 W

krk⋯W

1r1xj (1 le k le K) (7)

where the Wtr (1 le t le K 0 le r le m) matrices

are learnable3 α(j r1 rk i) is an attentionscore elaborated in sect42 and d

ki = sum(j⋯i)isinΦk

α(j⋯i)is the normalization factor The W k

rk⋯W1r1 ∣ 1 le

r1 rk le m matrices can be interpreted as thelow rank approximation of a mtimes⋯timesmktimesdtimesdtensor that assigns a separate transformation foreach k-hop relation where d is the dim of xi

Incoming messages from paths of differentlengths are aggregated via attention mecha-nism (Vaswani et al 2017)

zi =K

sumk=1

softmax(bilinear(s zki )) sdot zki (8)

Non-linear Activation Finally we apply shortcutconnection and nonlinear activation to obtain theoutput node embeddings

hprimei = σ (V hi + V

primezi) (9)

where V and Vprime are learnable model parameters

and σ is a non-linear activation function

42 Structured Relational AttentionHere we work towards effectively parameterizingthe attention score α(j r1 rk i) in Eq 7 for allk-hop paths without introducing O(mk) parametersWe first regard it as the probability of a relationsequence (φ(j) r1 rk φ(i)) conditioned on s

α(j r1 rk i) = p (φ(j) r1 rk φ(i) ∣ s) (10)

which can naturally be modeled by a probabilisticgraphical model such as conditional random field(Lafferty et al 2001)

p (⋯ ∣ s)prop exp(f(φ(j) s) +k

sumt=1

δ(rt s)

+kminus1

sumt=1

τ(rt rt+1) + g(φ(i) s))

∆= β(r1 rk s)IacuteOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveNtildeOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveIuml

Relation Type Attention

sdot γ(φ(j) φ(i) s)IacuteOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveNtildeOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveIumlNode Type Attention

(11)

3W

t0(0 le t le K) are introduced as padding matrices

so that K transformations are applied regardless of k thusensuring comparable scale of zki across different k

Model Time Space

G is a dense graph

K-hop KagNet O (mKnK+1

K) O (mKnK+1

K)K-layer RGCN O (mn2

K) O (mnK)MHGRN O (m2

n2K) O (mnK)

G is a sparse graph with maximum node degree ∆ ≪ n

K-hop KagNet O (mKnK∆

K) O (mKnK∆

K)K-layer RGCN O (mnK∆) O (mnK)MHGRN O (m2

nK∆) O (mnK)

Table 2 Computation complexity of different K-hopreasoning models on a densesparse multi-relationalgraph with n nodes and m relation types Despite thequadratic complexity wrt m MHGRNrsquos time cost issimilar to RGCN on GPUs with parallelizable matrixmultiplications (cf Fig 7)

where f(sdot) δ(sdot) and g(sdot) are parameterized bytwo-layer MLPs and τ(sdot) by a transition matrixof shape m timesm Intuitively β(sdot) models the im-portance of a k-hop relation while γ(sdot) models theimportance of messages from node type φ(j) toφ(i) (eg the model can learn to pass messagesonly from question entities to answer entities)

Our model scores a k-hop relation by decom-posing it into both context-aware single-hop re-lations (modeled by δ) and two-hop relations(modeled by τ ) We argue that τ is indis-pensable without which the model may assignhigh importance to illogical multi-hop relations(eg [AtLocation CapableOf]) or noisy re-lations (eg [RelatedTo RelatedTo])

43 Computation Complexity Analysis

Although message passing process in Eq 7 andattention module in Eq11 handles potentially ex-ponential number of paths it can be computed inlinear time with dynamic programming (see Ap-pendix D) As summarized in Table 2 time com-plexity and space complexity of MHGRN on asparse graph are both linear wrt either the maxi-mum path length K or the number of nodes n

44 Expressive Power of MHGRN

In addition to efficiency and scalability we now dis-cuss the modeling capacity of MHGRN With themessage passing formulation and relation-specifictransformations it is by nature the generalizationof RGCN It is also capable of directly modelingpaths making it interpretable as are path-basedmodels like RN and KagNet To show this we

first generalize RN (Eq 3) to the multi-hop set-ting and introduce K-hop RN (formal definition inAppendix E) which models multi-hop relation asthe composition of single-hop relations We showthat MHGRN is capable of representingK-hop RN(proof in Appendix F)

45 Learning Inference and Path DecodingWe now discuss the learning and inference processof MHGRN instantiated for QA tasks Followingthe problem formulation in sect2 we aim to deter-mine the plausibility of an answer option a isin Cgiven the question q with the information fromboth text s and graph G We first obtain the graphrepresentation g by performing attentive poolingover the output node embeddings of answer entitieshprimei ∣ i isin A Next we concatenate it with the textrepresentation s and compute the plausibility scoreby ρ(qa) = MLP(soplus g)

During training we maximize the plausibilityscore of the correct answer a by minimizing thecross-entropy loss

L = EqaC [minus logexp(ρ(q a))

sumaisinC exp(ρ(qa))] (12)

The whole model is trained end-to-end jointly withthe text encoder (eg RoBERTa)

During inference we predict the most plau-sible answer by argmaxaisinC ρ(qa) Addition-ally we can decode a reasoning path as evidencefor model predictions endowing our model withthe interpretability enjoyed by path-based mod-els Specifically we first determine the answerentity ilowast with the highest score in the pooling layerand the path length klowast with the highest score inEq 8 Then the reasoning path is decoded byargmax α(j r1 rklowast i

lowast) which can be com-puted in linear time using dynamic programming

5 Experimental SetupWe introduce how we construct G (sect51) thedatasets (sect52) as well as the baseline methods(sect53) Appendix C shows more implementationand experimental details for reproducibility

51 Extracting G from External KGWe use ConceptNet (Speer et al 2017) a general-domain knowledge graph as our external KG to testmodelsrsquo ability to harness structured knowledgesource Following KagNet (Lin et al 2019) wemerge relation types to increase graph density andadd reverse relations to construct a multi-relational

Methods BERT-Base BERT-Large RoBERTa-Large

IHdev-Acc() IHtest-Acc() IHdev-Acc() IHtest-Acc() IHdev-Acc() IHtest-Acc()

wo KG 5731 (plusmn107) 5347 (plusmn087) 6106 (plusmn085) 5539 (plusmn040) 7307 (plusmn045) 6869(plusmn056)

RGCN (Schlichtkrull et al 2018) 5694 (plusmn038) 5450 (plusmn056) 6298 (plusmn082) 5713 (plusmn036) 7269 (plusmn019) 6841 (plusmn066)GconAttn (Wang et al 2019) 5727 (plusmn070) 5484 (plusmn088) 6317 (plusmn018) 5736 (plusmn090) 7261( plusmn039) 6859 (plusmn096)KagNetdagger (Lin et al 2019) 5557 5619 6235 5716 - -RN (1-hop) 5827 (plusmn022) 5620 (plusmn045) 6304 (plusmn058) 5846 (plusmn071) 7457 (plusmn091) 6908 (plusmn021)RN (2-hop) 5981 (plusmn076) 5661 (plusmn068) 6336 (plusmn026) 5892 (plusmn014) 7365 (plusmn309) 6959 (plusmn380)

MHGRN 6036 (plusmn023) 5723 (plusmn082) 6329(plusmn051) 6059 (plusmn058) 7445 (plusmn010) 7111 (plusmn081)

Table 3 Performance comparison on CommonsenseQA in-house split We report in-house Dev (IHdev) andTest (IHtest) accuracy (mean and standard deviation of four runs) using the data split of Lin et al (2019) onCommonsenseQA dagger indicates reported results in its paper

Methods Single Ensemble

UnifiedQAdagger (Khashabi et al 2020) 791 -

RoBERTadagger 721 725RoBERTa + KEDGNdagger 725 744RoBERTa + KEdagger 733 -RoBERTa + HyKAS 20dagger (Ma et al 2019) 732 -RoBERTa + FreeLBdagger(Zhu et al 2020) 722 731XLNet + DREAMdagger 669 733XLNet + GRdagger (Lv et al 2019) 753 -ALBERTdagger (Lan et al 2019) - 765

RoBERTa + MHGRN (K = 2) 754 765

Table 4 Performance comparison on official test ofCommonsenseQA with leaderboard SoTAs4 (accuracyin ) dagger indicates reported results on leaderboard Uni-fiedQA uses T5-11B as text encoder whose number ofparameters is about 30 times more than other models

graph with 34 relation types (details in AppendixA) To extract an informative contextualized graphG from KG we recognize entity mentions in s andlink them to entities in ConceptNet with which weinitialize our node set V We then add to V all theentities that appear in any two-hop paths betweenpairs of mentioned entities Unlike KagNet we donot perform any pruning but instead reserve all theedges between nodes in V forming our G

52 Datasets

We evaluate models on two multiple-choice ques-tion answering datasets CommonsenseQA andOpenbookQA Both require world knowledge be-yond textual understanding to perform well

CommonsenseQA (Talmor et al 2019) neces-sitates various commonsense reasoning skills Thequestions are created with entities from ConceptNetand they are designed to probe latent compositionalrelations between entities in ConceptNet

OpenBookQA (Mihaylov et al 2018) provides

4Models based on ConceptNet are no longer shown on theleaderboard and we got our results from the organizers

elementary science questions together with an openbook of science facts This dataset also probesgeneral common sense beyond the provided facts

53 Compared Methods

We implement both knowledge-agnostic fine-tuning of pre-trained LMs and models that incor-porate KG as external sources as our baselinesAdditionally we directly compare our model withthe results from corresponding leaderboard Thesemethods typically leverage textual knowledge orextra training data as opposed to external KG Inall our implemented models we use pre-trainedLMs as text encoders for s for fair comparisonWe do compare our models with those (Ma et al2019 Lv et al 2019 Khashabi et al 2020) aug-mented by other text-form external knowledge (egWikipedia) although we stick to our focus of en-coding structured KG

Specifically we fine-tune BERT-BASE BERT-LARGE (Devlin et al 2019) and ROBERTA (Liuet al 2019b) for multiple-choice questions Wetake RGCN (Eq 2 in sect3) RN5 (Eq 3 in sect3)KagNet (Eq 4 in sect3) and GconAttn (Wanget al 2019) as baselines GconAttn generalizesmatch-LSTM (Wang and Jiang 2016) and achievessuccess in language inference tasks

6 Results and Discussions

In this section we present the results of our modelsin comparison with baselines as well as methodson the leaderboards for both CommonsenseQA andOpenbookQA We also provide analysis of modelsrsquocomponents and characteristics

5We use mean pooling for 1-hop RN and attentive poolingfor 2-hop RN (detailed in Appendix E)

Methods Dev () Test ()

T5-3Bdagger (Raffel et al 2019) - 8320UnifiedQAdagger (Khashabi et al 2020) - 8720

RoBERTa-Large (wo KG) 6676 (plusmn114) 6480 (plusmn237)+ RGCN 6465 (plusmn196) 6245 (plusmn157)+ GconAttn 6430 (plusmn099) 6190 (plusmn244)+ RN (1-hop) 6485 (plusmn111) 6365 (plusmn231)+ RN (2-hop) 6700 (plusmn071) 6520 (plusmn118)+ MHGRN (K = 3) 6810 (plusmn102) 6685 (plusmn119)

AristoRoBERTaV7dagger 792 778+ MHGRN (K = 3) 786 806

Table 5 Performance comparison on OpenbookQAdagger indicates reported results on leaderboard T5-3B is 8times larger than our models UnifiedQA is 30x larger

61 Main ResultsFor CommonsenseQA (Table 3) we first use thein-house data split (IH) of Lin et al (2019) (cfAppendix B) to compare our models with imple-mented baselines This is different from the offi-cial split used in the leaderboard methods Almostall KG-augmented models achieve performancegain over vanilla pre-trained LMs demonstratingthe value of external knowledge on this dataset Ad-ditionally we evaluate our MHGRN (with text en-coder being ROBERTA-LARGE) on official splitOF (Table 4) for fair comparison with other meth-ods on leaderboard in both single-model settingand ensemble-model setting With backbone be-ing T5 (Raffel et al 2019) UnifiedQA (Khashabiet al 2020) tops the leaderboard Consideringits training cost we do not intend to compare ourROBERTA-based model with it We achieve thebest performances among other models

For OpenbookQA (Table 5) we use official splitand build models with ROBERTA-LARGE as textencoder MHGRN surpasses all implemented base-lines with an absolute increase of sim2 on TestAlso as our approach is naturally compatible withthe methods that utilize textual knowledge or ex-tra data because in our paradigm the encodingof textual statement and graph are structurally-decoupled (Fig 3) To empirically show MHGRNcan bring gain over textual-knowledge empow-ered systems we replace our text encoder withAristoRoBERTaV76 and fine-tune our MHGRNupon OpenbookQA Empirically MHGRN contin-ues to bring benefit to strong-performing textual-knowledge empowered systems This indicatestextual knowledge and structured knowledge canpotentially be complementary

6httpsleaderboardallenaiorgopen_book_qasubmissionblcp1tu91i4gm0vf484g

Methods IHdev-Acc ()

MHGRN (K = 3) 7445 (plusmn010)- Type-specific transformation (sect41) 7316 (plusmn028)- Structured relational attention (sect42) 7326 (plusmn031)- Relation type attention (sect42) 7355 (plusmn068)- Node type attention (sect42) 7392 (plusmn065)

Table 6 Ablation study on model components (re-moving one component each time) using ROBERTA-LARGE as the text encoder We report the IHdev ac-curacy on CommonsenseQA

62 Performance Analysis

Ablation Study on Model Components Asshown in Table 6 disabling type-specific trans-formation results in sim 13 drop in performancedemonstrating the need for distinguishing nodetypes for QA tasks Our structured relational at-tention mechanism is also critical with its twosub-components contributing almost equally

Impact of the Amount of Training Data Weuse different fractions of training data of Common-senseQA and report results of fine-tuning text en-coders alone and jointly training text encoder andgraph encoder in Fig 5 Regardless of training datafraction our model shows consistently more per-formance improvement over knowledge-agnosticfine-tuning compared with the other graph encod-ing methods indicating MHGRNrsquos complementarystrengths to text encoders

20 40 60 80 100Percentage of Training Data

45

50

55

60

65

Accu

racy

()

MultiGRNKVMGconAttnwo KG

Figure 5 Performance change (accuracy in ) wrtthe amounts of training data on CommonsenseQA IHT-est set (same as Lin et al (2019))

Impact of Number of Hops (K) We investigatethe impact of hyperparameter K for MHGRN byits performance on CommonsenseQA (Fig 6) Theincrease ofK continues to bring benefits untilK =

4 However performance begins to drop whenK gt 3 This might be attributed to exponentialnoise in longer relational paths in knowledge graph

65 70 75Accuracy ()

K=1

K=2

K=3

K=4

K=5

Reasoning Hops

7324

7393

7465

7365

7416

Figure 6 Effect ofK in MHGRN We show IHDev ac-curacy of MHGRN on CommonsenseQA wrt hops

2 4 6 800

05

10

15

20

sba

tch

K Reasoning Hops

RGCNMHGRN

Figure 7 Analysis of model scalability Comparisonof per-batch training efficiency wrt hops K

63 Model ScalabilityFig 7 presents the computation cost of MHGRNand RGCN (measured by training time) Both growlinearly wrt K Although the theoretical complex-ity of MultiRGN ism times that of RGCN the ratioof their empirical cost only approaches 2 demon-strating that our model can be better parallelized

64 Model InterpretabilityWe can analyze our modelrsquos reasoning process bydecoding the reasoning path using the method de-scribed in sect45 Fig 8 shows two examples fromCommonsenseQA where our model correctly an-swers the questions and provides reasonable pathevidences In the example on the left the modellinks question entities and answer entity in a chainto support reasoning while the example on the rightprovides a case where our model leverage unmen-tioned entities to bridge the reasoning gap betweenquestion entity and answer entities in a way that iscoherent with the latent relation between CHAPEL

and the desired answer in the question

7 Related WorkKnowledge-Aware Methods for NLP Variouswork have investigated the potential to empowerNLP models with external knowledge Many at-tempt to extract structured knowledge either in theform of nodes (Yang and Mitchell 2017 Wanget al 2019) triples (Weissenborn et al 2017

Where is known for a multitude of wedding chapels

A town B texasC city D church building E Nevada

RenoChapel

PartOf

AtLocation

UsedForminus1 Synonym

AtLocation Synonym

Neveda

Baseball

Play_Baseball

HasProperty

UsedForminus1

Fun_to_Play

Why do parents encourage their kids to play baseball

A round B cheap C break window D hardE fun to play Fun_to_Play

HasProperty

Reno

Chapel

AtLocation

Neveda

PartOf

MultiGRNFigure 8 Case study on model interpretability Wepresent two sample questions from CommonsenseQAwith the reasoning paths output by MHGRN

Mihaylov and Frank 2018) paths (Bauer et al2018 Kundu et al 2019 Lin et al 2019) or sub-graphs (Li and Clark 2015) and encode them toaugment textual understanding

Recent success of pre-trained LMs motivatesmany (Pan et al 2019 Ye et al 2019 Zhang et al2018 Li et al 2019 Banerjee et al 2019) to probeLMsrsquo potential as latent knowledge bases This lineof work turn to textual knowledge (eg Wikipedia)to directly impart knowledge to pre-trained LMsThey generally fall into two paradigms 1) Fine-tuning LMs on large-scale general-domain datasets(eg RACE (Lai et al 2017)) or on knowledge-richtext 2) Providing LMs with evidence via informa-tion retrieval techniques However these modelscannot provide explicit reasoning and evidencethus hardly trustworthy They are also subject tothe availability of in-domain datasets and maxi-mum input token of pre-trained LMs

Neural Graph Encoding Graph Attention Net-works (GAT) (Velickovic et al 2018) incorpo-rates attention mechanism in feature aggregationRGCN (Schlichtkrull et al 2018) proposes rela-tional message passing which makes it applicableto multi-relational graphs However they only per-form single-hop message passing and cannot beinterpreted at path level Other work (Abu-El-Haijaet al 2019 Nikolentzos et al 2019) aggregate fora node its K-hop neighbors based on node-wisedistances but they are designed for non-relationalgraphs MHGRN addresses these issues by rea-soning on multi-relational graphs and being inter-pretable via maintaining paths as reasoning chains

8 Conclusion

We present a principled scalable method MHGRNthat can leverage general knowledge via multi-hopreasoning over interpretable structures (eg Con-ceptNet) The proposed MHGRN generalizes andcombines the advantages of GNNs and path-basedreasoning models It explicitly performs multi-hop

relational reasoning and is empirically shown tooutperform existing methods with superior scal-ablility and interpretability

ReferencesSami Abu-El-Haija Bryan Perozzi Amol Kapoor

Nazanin Alipourfard Kristina Lerman Hrayr Haru-tyunyan Greg Ver Steeg and Aram Galstyan 2019Mixhop Higher-order graph convolutional architec-tures via sparsified neighborhood mixing In Pro-ceedings of the 36th International Conference onMachine Learning ICML 2019 9-15 June 2019Long Beach California USA volume 97 of Pro-ceedings of Machine Learning Research pages 21ndash29 PMLR

Pratyay Banerjee Kuntal Kumar Pal Arindam Mi-tra and Chitta Baral 2019 Careful selection ofknowledge to solve open book question answeringIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages6120ndash6129 Florence Italy Association for Compu-tational Linguistics

Lisa Bauer Yicheng Wang and Mohit Bansal 2018Commonsense for generative multi-hop question an-swering tasks In Proceedings of the 2018 Confer-ence on Empirical Methods in Natural LanguageProcessing Brussels Belgium October 31 - Novem-ber 4 2018 pages 4220ndash4230 Association for Com-putational Linguistics

Peter Clark Isaac Cowhey Oren Etzioni Tushar KhotAshish Sabharwal Carissa Schoenick and OyvindTafjord 2018 Think you have solved questionanswering try arc the AI2 reasoning challengeCoRR abs180305457

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

Justin Gilmer Samuel S Schoenholz Patrick F RileyOriol Vinyals and George E Dahl 2017 Neuralmessage passing for quantum chemistry In Pro-ceedings of the 34th International Conference onMachine Learning ICML 2017 Sydney NSW Aus-tralia 6-11 August 2017 volume 70 of Proceedingsof Machine Learning Research pages 1263ndash1272PMLR

Daniel Khashabi Tushar Khot Ashish SabharwalOyvind Tafjord Peter Clark and Hannaneh Ha-jishirzi 2020 Unifiedqa Crossing format bound-aries with a single qa system

Thomas N Kipf and Max Welling 2017 Semi-supervised classification with graph convolutionalnetworks In 5th International Conference on Learn-ing Representations ICLR 2017 Toulon FranceApril 24-26 2017 Conference Track ProceedingsOpenReviewnet

Souvik Kundu Tushar Khot Ashish Sabharwal andPeter Clark 2019 Exploiting explicit paths formulti-hop reading comprehension In Proceedingsof the 57th Annual Meeting of the Association forComputational Linguistics pages 2737ndash2747 Flo-rence Italy Association for Computational Linguis-tics

John D Lafferty Andrew McCallum and FernandoC N Pereira 2001 Conditional random fieldsProbabilistic models for segmenting and labeling se-quence data In Proceedings of the Eighteenth Inter-national Conference on Machine Learning (ICML2001) Williams College Williamstown MA USAJune 28 - July 1 2001 pages 282ndash289 MorganKaufmann

Guokun Lai Qizhe Xie Hanxiao Liu Yiming Yangand Eduard Hovy 2017 RACE Large-scale ReAd-ing comprehension dataset from examinations InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing pages785ndash794 Copenhagen Denmark Association forComputational Linguistics

Zhen-Zhong Lan Mingda Chen Sebastian Good-man Kevin Gimpel Piyush Sharma and RaduSoricut 2019 Albert A lite bert for self-supervised learning of language representationsArXiv abs190911942

Shiyang Li Jianshu Chen and Dian Yu 2019 Teach-ing pretrained models with commonsense reason-ing A preliminary kb-based approach CoRRabs190909743

Yang Li and Peter Clark 2015 Answering elementaryscience questions by constructing coherent scenesusing background knowledge In Proceedings ofthe 2015 Conference on Empirical Methods in Nat-ural Language Processing pages 2007ndash2012 Lis-bon Portugal Association for Computational Lin-guistics

Bill Yuchen Lin Xinyue Chen Jamin Chen and Xi-ang Ren 2019 KagNet Knowledge-aware graphnetworks for commonsense reasoning In Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) pages 2829ndash2839 HongKong China Association for Computational Lin-guistics

Liyuan Liu Haoming Jiang Pengcheng He WeizhuChen Xiaodong Liu Jianfeng Gao and Jiawei Han2019a On the variance of the adaptive learning rateand beyond CoRR abs190803265

Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019bRoberta A robustly optimized BERT pretraining ap-proach CoRR abs190711692

Shangwen Lv Daya Guo Jingjing Xu Duyu TangNan Duan Ming Gong Linjun Shou Daxin JiangGuihong Cao and Songlin Hu 2019 Graph-based reasoning over heterogeneous external knowl-edge for commonsense question answering CoRRabs190905311

Kaixin Ma Jonathan Francis Quanyang Lu Eric Ny-berg and Alessandro Oltramari 2019 Towardsgeneralizable neuro-symbolic systems for common-sense question answering In Proceedings of theFirst Workshop on Commonsense Inference in Natu-ral Language Processing pages 22ndash32 Hong KongChina Association for Computational Linguistics

Todor Mihaylov Peter Clark Tushar Khot and AshishSabharwal 2018 Can a suit of armor conduct elec-tricity a new dataset for open book question an-swering In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processingpages 2381ndash2391 Brussels Belgium Associationfor Computational Linguistics

Todor Mihaylov and Anette Frank 2018 Knowledge-able reader Enhancing cloze-style reading compre-hension with external commonsense knowledge InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 821ndash832 Melbourne AustraliaAssociation for Computational Linguistics

Giannis Nikolentzos George Dasoulas and MichalisVazirgiannis 2019 k-hop graph neural networksCoRR abs190706051

Xiaoman Pan Kai Sun Dian Yu Heng Ji and Dong Yu2019 Improving question answering with externalknowledge CoRR abs190200993

Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2019 Exploring the limitsof transfer learning with a unified text-to-text trans-former

Adam Santoro David Raposo David G T BarrettMateusz Malinowski Razvan Pascanu Peter WBattaglia and Tim Lillicrap 2017 A simple neu-ral network module for relational reasoning In Ad-vances in Neural Information Processing Systems30 Annual Conference on Neural Information Pro-cessing Systems 2017 4-9 December 2017 LongBeach CA USA pages 4967ndash4976

Maarten Sap Hannah Rashkin Derek Chen RonanLe Bras and Yejin Choi 2019 Social IQa Com-monsense reasoning about social interactions InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the

9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) pages 4463ndash4473 Hong Kong China Association for Computa-tional Linguistics

Michael Sejr Schlichtkrull Thomas N Kipf PeterBloem Rianne van den Berg Ivan Titov and MaxWelling 2018 Modeling relational data with graphconvolutional networks In The Semantic Web - 15thInternational Conference ESWC 2018 HeraklionCrete Greece June 3-7 2018 Proceedings volume10843 of Lecture Notes in Computer Science pages593ndash607 Springer

Robyn Speer Joshua Chin and Catherine Havasi 2017Conceptnet 55 An open multilingual graph of gen-eral knowledge In Proceedings of the Thirty-FirstAAAI Conference on Artificial Intelligence Febru-ary 4-9 2017 San Francisco California USApages 4444ndash4451 AAAI Press

Alon Talmor Jonathan Herzig Nicholas Lourie andJonathan Berant 2019 CommonsenseQA A ques-tion answering challenge targeting commonsenseknowledge In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4149ndash4158 Minneapolis Minnesota Associ-ation for Computational Linguistics

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez LukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems 30 Annual Conference on NeuralInformation Processing Systems 2017 4-9 Decem-ber 2017 Long Beach CA USA pages 5998ndash6008

Petar Velickovic Guillem Cucurull Arantxa CasanovaAdriana Romero Pietro Lio and Yoshua Bengio2018 Graph attention networks In 6th Inter-national Conference on Learning RepresentationsICLR 2018 Vancouver BC Canada April 30 - May3 2018 Conference Track Proceedings OpenRe-viewnet

Shuohang Wang and Jing Jiang 2016 Learning nat-ural language inference with LSTM In Proceed-ings of the 2016 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics Human Language Technologies pages1442ndash1451 San Diego California Association forComputational Linguistics

Xiaoyan Wang Pavan Kapanipathi Ryan MusaMo Yu Kartik Talamadupula Ibrahim AbdelazizMaria Chang Achille Fokoue Bassem MakniNicholas Mattei and Michael Witbrock 2019 Im-proving natural language inference using externalknowledge in the science questions domain In TheThirty-Third AAAI Conference on Artificial Intelli-gence AAAI 2019 The Thirty-First Innovative Ap-plications of Artificial Intelligence Conference IAAI2019 The Ninth AAAI Symposium on Educational

Advances in Artificial Intelligence EAAI 2019 Hon-olulu Hawaii USA January 27 - February 1 2019pages 7208ndash7215 AAAI Press

Dirk Weissenborn Tomas Kocisky and Chris Dyer2017 Dynamic integration of background knowl-edge in neural nlu systems arXiv preprintarXiv170602596

An Yang Quan Wang Jing Liu Kai Liu Yajuan LyuHua Wu Qiaoqiao She and Sujian Li 2019 En-hancing pre-trained language representations withrich knowledge for machine reading comprehensionIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2346ndash2357 Florence Italy Association for Compu-tational Linguistics

Bishan Yang and Tom Mitchell 2017 Leveragingknowledge bases in LSTMs for improving machinereading In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1 Long Papers) pages 1436ndash1446 Van-couver Canada Association for Computational Lin-guistics

Zhi-Xiu Ye Qian Chen Wen Wang and Zhen-HuaLing 2019 Align mask and select A sim-ple method for incorporating commonsense knowl-edge into language representation models CoRRabs190806725

Yuyu Zhang Hanjun Dai Kamil Toraman andLe Song 2018 Kgˆ2 Learning to reason scienceexam questions with contextual knowledge graphembeddings CoRR abs180512393

Chen Zhu Yu Cheng Zhe Gan Siqi Sun Tom Gold-stein and Jingjing Liu 2020 Freelb Enhanced ad-versarial training for natural language understandingIn International Conference on Learning Represen-tations

A Merging Types of Relations inConceptNet

Relation Merged Relation

AtLocationAtLocation

LocatedNear

CausesCausesCausesDesire

MotivatedByGoal

AntonymAntonym

DistinctFrom

HasSubevent

HasSubevent

HasFirstSubeventHasLastSubeventHasPrerequisite

EntailsMannerOf

IsAIsAInstanceOf

DefinedAs

PartOfPartOf

HasA

RelatedToRelatedToSimilarTo

Synonym

Table 7 Relations in ConceptNet that are being mergedin pre-processing RelationX indicates the reverse re-lation of RelationX

We merge relations that are close in semanticsas well as in the general usage of triple instancesin ConceptNet

B Dataset Split Specifications

Train Dev Test

CommonsenseQA (OF) 9 741 1 221 1 140CommonsenseQA (IH) 8 500 1 221 1 241OpenbookQA 4 957 500 500

Table 8 Numbers of instances in different datasetsplits

CommonsenseQA 7 and OpenbookQA 8 all havetheir leaderboards with training and development

7httpswwwtau-nlporgcommonsenseqa8httpsleaderboardallenaiorgopen_

book_qasubmissionspublic

set publicly available As the ground truth labelsfor CommonsenseQA are not readily available formodel analysis we take 1241 examples from of-ficial training examples as our in-house test ex-amples and regard the remaining 8500 ones asour in-house training examples (CommonsenseQA(IH))

C Implementation Details

CommonsenseQA OpenbookQA

BERT-BASE 3 times 10minus5 -

BERT-LARGE 2 times 10minus5 -

ROBERTA-LARGE 1 times 10minus5

1 times 10minus5

Table 9 Learning rate for text encoders on differentdatasets

CommonsenseQA OpenbookQA

RN 3 times 10minus4

3 times 10minus4

RGCN 1 times 10minus3

1 times 10minus3

GconAttn 3 times 10minus4

1 times 10minus4

MHGRN 1 times 10minus3

1 times 10minus3

Table 10 Learning rate for graph encoders on differentdatasets

Param

RN 399KRGCN 365KGconAttn 453KMHGRN 544K

Table 11 Numbers of parameters of different graph en-coders

Our models are implemented in PyTorch We usecross-entropy loss and RAdam (Liu et al 2019a)optimizer We find it beneficial to use separatelearning rates for the text encoder and the graph en-coder We tune learning rates for text encoders andgraph encoders on two datasets We first fine-tuneROBERTA-LARGE BERT-LARGE BERT-BASE

on CommonsenseQA and ROBERTA-LARGE onOpenbookQA respectively and choose a dataset-specific learning rate from 1times10

minus5 2times10

minus5 3times

10minus5 6 times 10

minus5 1 times 10

minus4 for each text encoderbased on the best performance on development setas listed in Table 9 We report the performance ofthese fine-tuned text encoders and also adopt theirdataset-specific optimal learning rates in joint train-ing with graph encoders For models that involve

KG the learning rate of their graph encoders arechosen from 1 times 10

minus4 3 times 10

minus4 1 times 10

minus3 3 times

10minus3 based on their best development set perfor-

mance with ROBERTA-LARGE as the text encoderWe report the optimal learning rates for graph en-coders in Table 10 In training we set the max-imum input sequence length to text encoders to64 batch size to 32 and perform early stoppingAristoRoBERTaV7+MHGRN is the only exceptionIn order to host fair comparison we follow Aris-toRoBERTaV7 and set the batch size to 16 maxinput sequence length to 256 and choose a decoderlearning rate from 1 times 10

minus3 2 times 10

minus5For the input node features we first use tem-

plates to turn knowledge triples in ConceptNet intosentences and feed them into pre-trained BERT-LARGE obtaining a sequence of token embeddingsfrom the last layer of BERT-LARGE for each tripleFor each entity in ConceptNet we perform meanpooling over the tokens of the entityrsquos occurrencesacross all the sentences to form a 1024d vector asits corresponding node feature We use this set offeatures for all our implemented models

We use 2-layer RGCN and single-layer MHGRNacross our experiments

The numbers of parameter for each graph en-coder are listed in Table 11

D Dynamic Programming Algorithm forEq 7

To show that multi-hop message passing can becomputed in linear time we observe that Eq 7 canbe re-written in matrix form

Zk= (Dk)minus1 sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FXW1r1

⊤⋯W

krk

sdotWk+10

⊤⋯W

K0

⊤ (1 le k le K) (13)

where G = diag(exp([g(φ(v1) s) g(φ(vn)s)]) (F is similarly defined) Ar is the adjacencymatrix for relation r and D

k is defined as follows

Dk= diag( sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FX1) (1 le k le K) (14)

Using this matrix formulation we can compute Eq7 using dynamic programming

Algorithm 1 Dynamic programming algorithm formulti-hop message passingInput sXAr(1 le r le m)W t

r (r isin R 1 le t le

k)F G δ τOutput Z

1 WKlarr I

2 for k larr K minus 1 to 1 do3 W

klarrW

k+10 W

k+1

4 end for5 for r isin R do6 Mr larr FX7 end for8 for k larr 1 to K do9 if k gt 1 then

10 for r isin R do11 M

primer larr e

δ(rs)ArsumrprimeisinR e

τ(rprimers)MrprimeW

kr

12 end for13 for r isin R do14 Mr larrM

primer

15 end for16 else17 for r isin R do18 Mr larr e

δ(rs) sdotArMrWkr

19 end for20 end if21 Z

klarr GsumrisinR MrW

k

22 end for23 Replace W

tr (0 le r le m 1 le t le k) with identity

matrices and X with 1 and re-run line 1 - line 19 tocompute d

1 d

K

24 for k larr 1 to K do25 Z

klarr (diag(dk))minus1

Zk

26 end for27 return Z

1Z

2 Z

k

E Formal Definition of K-hop RN

Definition 1 (K-hop Relation Network) A multi-hop relation network is a function that maps amulti-relational graph to a fixed size vector

KHopRN(G W E H) =K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)sdotW (hjoplus(er1⋯erk)oplushi)(15)

where denotes element-wise product and β(⋯) =1(K∣A∣ sdot ∣(j prime i) isin G ∣ j prime isin Q∣) defines thepooling weights

F Expressing K-hop RN with MultiGRN

Theorem 1 Given any W E H there exists a pa-rameter setting such that the output of the modelbecomes KHopRN(GW E H) for arbitrary G

Proof Suppose W = [W1 W2 W3] whereW1 W3 isin Rd3timesd1 W2 isin Rd3timesd2 ForMHRGN we set the parameters as followsH = HUlowast = [I0] isin R(d1+d2)timesd1 blowast =

[01]⊤ isin Rd1+d2 Wtr = diag(1 oplus er) isin

R(d1+d2)times(d1+d2)(r isin R 1 le t le K)V = W3 isin

Rd3timesd1 Vprime= [W1 W2] isin Rd3times(d1+d2) We dis-

able the relation type attention module and enablemessage passing only from Q to A By furtherchoosing σ as the identity function and perform-ing pooling over A we observe that the output of

MultiGRN becomes

1

∣A∣ sumiisinA

hprimei

=1

∣A∣ (V hi + Vprimezi)

=1

K∣A∣K

sumk=1

(V hi + Vprimezki )

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)(V hi+

VprimeW

krk⋯W

1r1xj)

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ

β(⋯)(V hi + VprimeW

krk

⋯W1r1Uφ(j)hj + V

primeW

krk⋯W

1r1bφ(j))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)(W3hi + W1hj

+ W2(er1 ⋯ erk))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)W (hjoplus

(er1 ⋯ erk)oplus hi)= RN(G W E H)

(16)

Page 2: arXiv:2005.00646v1 [cs.CL] 1 May 20200 80 #node 3.1e3 #path K=2 0 80 #node 6.3e4 K=2 K=3 Figure 2: Number of K-hop relational paths w.r.t. the node count in extracted graphs on Common-senseQA

0 80node

31e3

pat

hK=2

0 80node

63e4K=2K=3

Figure 2 Number of K-hop relational paths wrtthe node count in extracted graphs on Common-senseQA Left The path count is polynomial wrt thenumber of nodes Right The path count is exponentialwrt the number of hops

A straightforward approach to leveraging aknowledge graph is to directly model these rela-tional paths KagNet (Lin et al 2019) and MH-PGM (Bauer et al 2018) model multi-hop relationsby extracting relational paths from KG and encod-ing them with sequence models Application ofattention mechanisms upon these relational pathscan further offer good interpretability Howeverthese models are hardly scalable because the num-ber of possible paths in a graph is (1) polynomialwrt the number of nodes (2) exponential wrt thepath length (see Fig 2) Therefore some (Weis-senborn et al 2017 Mihaylov and Frank 2018)resort to only using one-hop paths namely triplesto balance scalability and reasoning capacities

Graph neural networks (GNNs) in contrast en-joy better scalability via their message passing for-mulation but usually lack transparency The mostcommonly used GNNsrsquo variant Graph Convolu-tional Networks (GCNs) (Kipf and Welling 2017)perform message passing by aggregating neighbor-hood information for each node but ignore the rela-tion types RGCNs (Schlichtkrull et al 2018) gen-eralize GCNs by performing relation-specific ag-gregation making it applicable to multi-relationalgraphs However these models do not distinguishthe importance of different neighbors or relationtypes and thus cannot provide explicit relationalpaths for model behavior interpretation

In this paper we propose a novel graph encod-ing architecture Multi-hop Graph Relation Net-work (MHGRN) which combines the strengths ofpath-based models and GNNs Our model inheritsscalability from GNNs by preserving the messagepassing formulation It also enjoys interpretabilityof path-based models by incorporating structuredrelational attention mechanism Our key motivationis to perform multi-hop message passing within a

GCN RGCN KagNet MHGRN

Multi-Relational Encoding 7 3 3 3Interpretable 7 7 3 3

Scalable wrt node 3 3 7 3Scalable wrt hop 3 3 7 3

Table 1 Properties of our MHGRN and other repre-sentative models for graph encoding

single layer to allow each node to directly attend toits multi-hop neighbours towards multi-hop rela-tional reasoning We outline the favorable featuresof knowledge-aware QA models in Table 1 andcompare MHGRN with them

We summarize the main contributions of thiswork as follows 1) We propose MHGRN a novelmodel architecture tailored to multi-hop relationalreasoning which explicitly models multi-hop rela-tional paths at scale 2) We propose a structuredrelational attention mechanism for efficient and in-terpretable modeling of multi-hop reasoning pathsalong with its training and inference algorithms 3)We conduct extensive experiments on two questionanswering datasets and show that our models bringsignificant improvements compared to knowledge-agnostic PTLMs and outperform other graph en-coding methods by a large margin

2 Problem Formulation and Overview

In this paper we limit the scope to the task ofmultiple-choice question answering although itcan be easily generalized to other knowledge-guided tasks (eg natural language inference) Theoverall paradigm of knowledge-aware QA is illus-trated in Fig 3 Formally given an external knowl-edge graph (KG) as the knowledge source and aquestion q our goal is to identify the correct answerfrom a set C of given options We turn this probleminto measuring the plausibility score between q andeach option a isin C then selecting the one with thehighest plausibility score

Denote the vector representations of question qand option a as q and a To measure the scorefor q and a we first concatenate them to forma statement vector s = [qa] Then we extractfrom the external KG a subgraph G (ie schemagraph in KagNet (Lin et al 2019)) with the guid-ance of s (detailed in sect51) This contextualizedsubgraph is defined as a multi-relational graphG = (V E φ) Here V is a subset of entity inthe external KG containing only those relevant tos E sube V timesR times V is the set of edges that connectnodes in V where R = 1⋯m are ids of all

PlausibilityScore

Graph ExtractionFrom KG

Graph Encoder

Text Encoder

120022

119956119954

119938

Figure 3 Overview of the knowledge-aware QAframework It integrates the output from graph en-coder (for relational reasoning over contextual sub-graphs) and text encoder (for textual understanding) togenerate the plausibility score for an answer option

pre-defined relation types The mapping functionφ(i) ∶ V rarr T = EqEaEo takes node i isin V asinput and outputs Eq if i is an entity mentioned inq Ea if it is mentioned in a or Eo otherwise Wefinally encode the statement to s G to g concate-nate s and g for calculating the plausibility score

3 Background Multi-Relational GraphEncoding Methods

We leave encoding of s to pre-trained languagemodels and focus on the challenge of encodinggraph G to capture latent relations between enti-ties Current methods for encoding multi-relationalgraphs mainly fall into two categories GNNs andpath-based models GNNs encode structured in-formation by passing messages between nodes di-rectly operating on the graph structure while path-based methods first decompose the graph into pathsand then pool features over them

Graph Encoding with GNNs For a graph withn nodes a graph neural network (GNN) takes aset of node features h1h2 hn as input andcomputes their corresponding node embeddingshprime1hprime2 hprimen via message passing (Gilmeret al 2017) A compact graph representation forG can thus be obtained by pooling over the nodeembeddings hprimei

GNN(G) = Pool(hprime1hprime2 hprimen) (1)

As a notable variant of GNNs graph convo-lutional networks (GCNs) (Kipf and Welling2017) additionally update node embeddings byaggregating messages from its direct neighborsRGCNs (Schlichtkrull et al 2018) extend GCNs toencode multi-relational graphs by defining relation-

specific weight matrix Wr for each edge type

hprimei = σ

⎛⎜⎝(sumrisinR

∣N ri ∣)

minus1

sumrisinR

sumjisinN r

i

Wrhj⎞⎟⎠ (2)

where N ri denotes neighbors of node i under rela-

tion r2

While GNNs have proved to have good scala-bility their reasoning is done at the node levelmaking them incompatible with modeling pathsThis property also hinders the modelrsquos decisionsfrom being interpretable at the path levelGraph Encoding with Path-Based Models Inaddition to directly modeling the graph with GNNsa graph can also be viewed as a set of relationalpaths connecting pairs of entities

Relation Networks (RNs) (Santoro et al 2017)can be adapted to multi-relational graph encodingunder QA settings RNs use MLPs to encode alltriples (one-hop paths) in G whose head entity isin Q = j ∣ φ(j) = Eq and tail entity is inA = i ∣ φ(i) = Ea It then pools the tripleembeddings to generate a vector for G as follows

RN(G) = Pool(MLP(hj oplus eroplus

hi) ∣ j isin Q i isin A (j r i) isin E) (3)

Here hj and hi are features for nodes j and i er isthe embedding of relation r isin R oplus denotes vectorconcatenation

To further equip RN with the ability to modelnondegenerate paths KagNet (Lin et al 2019)adopts LSTMs to encode all paths connecting ques-tion entities and answer entities with lengths nomore than K It then aggregates all path embed-dings via attention mechanism

KAGNET(G) = Pool(LSTM(j r1 rk i) ∣

(j r1 j1)⋯ (jkminus1 rk i) isin E 1 le k le K)(4)

4 Proposed Method Multi-Hop GraphRelation Network (MHGRN)

This section presents Multi-hop Graph RelationNetwork (MHGRN) a novel GNN architecturethat unifies both GNNs and path-based modelsMHGRN inherits path-level reasoning and inter-pretabilty from path-based models while preserv-ing good scalability of GNNs

2For simplicity we assume a single graph convolutional

AttentivePooling

2 Multi-hop Message Passing

Node Features

Text Encodereg BERT

1 Node Type Specific Transform

3 Nonlinear Activation

timesLayer

PlausibilityScore

MLP

120022 119956

AggregateMessages

120630(119947 119955120783 hellip 119955119948 119946)

Structured Relational Att

Learn

Child

Child CapableOf UsedForminus1 AtLocation

Desires UsedForminus1 Synonym

AtLocation Synonym

Sit AtLocationUsedForminus1

Desk AtLocation

hellipMulti-hop Message Passing

Figure 4 Our proposed MHGRN architecturefor relational reasoning MHGRN takes a multi-relational graph G and a (question-answer) statementvector s as input and outputs a scalar that represent theplausibility score of this statement

41 MHGRN Model ArchitectureWe follow the GNN framework introduced in sect3where node features can be initialized with pre-trained weights (details in Appendix C) Here wefocus on the computation of node embeddings

Type-Specific Transformation To make ourmodel aware of the node type φ we first performnode type specific linear transformation on the in-put node features

xi = Uφ(i)hi + bφ(i) (5)

where the learnable parameters U and b are specificto the type of node i

Multi-Hop Message Passing As mentioned be-fore our motivation is to endow GNNs with thecapability of directly modeling paths To this endwe propose to pass messages directly over all therelational paths of lengths up to K The set of validk-hop relational paths is defined as

Φk = (j r1 rk i) ∣ (j r1 j1)⋯ (jkminus1 rk i) isin E (1 le k le K) (6)

We perform k-hop (1 le k le K) message passingover these paths which is a generalization of the

layer In practice multiple layers are stacked to enable mes-sage passing from multi-hop neighbors

single-hop message passing in RGCNs (see Eq 2)

zki = sum

(jr1rki)isinΦk

α(j r1 rk i)dki sdotWK0

⋯Wk+10 W

krk⋯W

1r1xj (1 le k le K) (7)

where the Wtr (1 le t le K 0 le r le m) matrices

are learnable3 α(j r1 rk i) is an attentionscore elaborated in sect42 and d

ki = sum(j⋯i)isinΦk

α(j⋯i)is the normalization factor The W k

rk⋯W1r1 ∣ 1 le

r1 rk le m matrices can be interpreted as thelow rank approximation of a mtimes⋯timesmktimesdtimesdtensor that assigns a separate transformation foreach k-hop relation where d is the dim of xi

Incoming messages from paths of differentlengths are aggregated via attention mecha-nism (Vaswani et al 2017)

zi =K

sumk=1

softmax(bilinear(s zki )) sdot zki (8)

Non-linear Activation Finally we apply shortcutconnection and nonlinear activation to obtain theoutput node embeddings

hprimei = σ (V hi + V

primezi) (9)

where V and Vprime are learnable model parameters

and σ is a non-linear activation function

42 Structured Relational AttentionHere we work towards effectively parameterizingthe attention score α(j r1 rk i) in Eq 7 for allk-hop paths without introducing O(mk) parametersWe first regard it as the probability of a relationsequence (φ(j) r1 rk φ(i)) conditioned on s

α(j r1 rk i) = p (φ(j) r1 rk φ(i) ∣ s) (10)

which can naturally be modeled by a probabilisticgraphical model such as conditional random field(Lafferty et al 2001)

p (⋯ ∣ s)prop exp(f(φ(j) s) +k

sumt=1

δ(rt s)

+kminus1

sumt=1

τ(rt rt+1) + g(φ(i) s))

∆= β(r1 rk s)IacuteOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveNtildeOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveIuml

Relation Type Attention

sdot γ(φ(j) φ(i) s)IacuteOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveNtildeOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveIumlNode Type Attention

(11)

3W

t0(0 le t le K) are introduced as padding matrices

so that K transformations are applied regardless of k thusensuring comparable scale of zki across different k

Model Time Space

G is a dense graph

K-hop KagNet O (mKnK+1

K) O (mKnK+1

K)K-layer RGCN O (mn2

K) O (mnK)MHGRN O (m2

n2K) O (mnK)

G is a sparse graph with maximum node degree ∆ ≪ n

K-hop KagNet O (mKnK∆

K) O (mKnK∆

K)K-layer RGCN O (mnK∆) O (mnK)MHGRN O (m2

nK∆) O (mnK)

Table 2 Computation complexity of different K-hopreasoning models on a densesparse multi-relationalgraph with n nodes and m relation types Despite thequadratic complexity wrt m MHGRNrsquos time cost issimilar to RGCN on GPUs with parallelizable matrixmultiplications (cf Fig 7)

where f(sdot) δ(sdot) and g(sdot) are parameterized bytwo-layer MLPs and τ(sdot) by a transition matrixof shape m timesm Intuitively β(sdot) models the im-portance of a k-hop relation while γ(sdot) models theimportance of messages from node type φ(j) toφ(i) (eg the model can learn to pass messagesonly from question entities to answer entities)

Our model scores a k-hop relation by decom-posing it into both context-aware single-hop re-lations (modeled by δ) and two-hop relations(modeled by τ ) We argue that τ is indis-pensable without which the model may assignhigh importance to illogical multi-hop relations(eg [AtLocation CapableOf]) or noisy re-lations (eg [RelatedTo RelatedTo])

43 Computation Complexity Analysis

Although message passing process in Eq 7 andattention module in Eq11 handles potentially ex-ponential number of paths it can be computed inlinear time with dynamic programming (see Ap-pendix D) As summarized in Table 2 time com-plexity and space complexity of MHGRN on asparse graph are both linear wrt either the maxi-mum path length K or the number of nodes n

44 Expressive Power of MHGRN

In addition to efficiency and scalability we now dis-cuss the modeling capacity of MHGRN With themessage passing formulation and relation-specifictransformations it is by nature the generalizationof RGCN It is also capable of directly modelingpaths making it interpretable as are path-basedmodels like RN and KagNet To show this we

first generalize RN (Eq 3) to the multi-hop set-ting and introduce K-hop RN (formal definition inAppendix E) which models multi-hop relation asthe composition of single-hop relations We showthat MHGRN is capable of representingK-hop RN(proof in Appendix F)

45 Learning Inference and Path DecodingWe now discuss the learning and inference processof MHGRN instantiated for QA tasks Followingthe problem formulation in sect2 we aim to deter-mine the plausibility of an answer option a isin Cgiven the question q with the information fromboth text s and graph G We first obtain the graphrepresentation g by performing attentive poolingover the output node embeddings of answer entitieshprimei ∣ i isin A Next we concatenate it with the textrepresentation s and compute the plausibility scoreby ρ(qa) = MLP(soplus g)

During training we maximize the plausibilityscore of the correct answer a by minimizing thecross-entropy loss

L = EqaC [minus logexp(ρ(q a))

sumaisinC exp(ρ(qa))] (12)

The whole model is trained end-to-end jointly withthe text encoder (eg RoBERTa)

During inference we predict the most plau-sible answer by argmaxaisinC ρ(qa) Addition-ally we can decode a reasoning path as evidencefor model predictions endowing our model withthe interpretability enjoyed by path-based mod-els Specifically we first determine the answerentity ilowast with the highest score in the pooling layerand the path length klowast with the highest score inEq 8 Then the reasoning path is decoded byargmax α(j r1 rklowast i

lowast) which can be com-puted in linear time using dynamic programming

5 Experimental SetupWe introduce how we construct G (sect51) thedatasets (sect52) as well as the baseline methods(sect53) Appendix C shows more implementationand experimental details for reproducibility

51 Extracting G from External KGWe use ConceptNet (Speer et al 2017) a general-domain knowledge graph as our external KG to testmodelsrsquo ability to harness structured knowledgesource Following KagNet (Lin et al 2019) wemerge relation types to increase graph density andadd reverse relations to construct a multi-relational

Methods BERT-Base BERT-Large RoBERTa-Large

IHdev-Acc() IHtest-Acc() IHdev-Acc() IHtest-Acc() IHdev-Acc() IHtest-Acc()

wo KG 5731 (plusmn107) 5347 (plusmn087) 6106 (plusmn085) 5539 (plusmn040) 7307 (plusmn045) 6869(plusmn056)

RGCN (Schlichtkrull et al 2018) 5694 (plusmn038) 5450 (plusmn056) 6298 (plusmn082) 5713 (plusmn036) 7269 (plusmn019) 6841 (plusmn066)GconAttn (Wang et al 2019) 5727 (plusmn070) 5484 (plusmn088) 6317 (plusmn018) 5736 (plusmn090) 7261( plusmn039) 6859 (plusmn096)KagNetdagger (Lin et al 2019) 5557 5619 6235 5716 - -RN (1-hop) 5827 (plusmn022) 5620 (plusmn045) 6304 (plusmn058) 5846 (plusmn071) 7457 (plusmn091) 6908 (plusmn021)RN (2-hop) 5981 (plusmn076) 5661 (plusmn068) 6336 (plusmn026) 5892 (plusmn014) 7365 (plusmn309) 6959 (plusmn380)

MHGRN 6036 (plusmn023) 5723 (plusmn082) 6329(plusmn051) 6059 (plusmn058) 7445 (plusmn010) 7111 (plusmn081)

Table 3 Performance comparison on CommonsenseQA in-house split We report in-house Dev (IHdev) andTest (IHtest) accuracy (mean and standard deviation of four runs) using the data split of Lin et al (2019) onCommonsenseQA dagger indicates reported results in its paper

Methods Single Ensemble

UnifiedQAdagger (Khashabi et al 2020) 791 -

RoBERTadagger 721 725RoBERTa + KEDGNdagger 725 744RoBERTa + KEdagger 733 -RoBERTa + HyKAS 20dagger (Ma et al 2019) 732 -RoBERTa + FreeLBdagger(Zhu et al 2020) 722 731XLNet + DREAMdagger 669 733XLNet + GRdagger (Lv et al 2019) 753 -ALBERTdagger (Lan et al 2019) - 765

RoBERTa + MHGRN (K = 2) 754 765

Table 4 Performance comparison on official test ofCommonsenseQA with leaderboard SoTAs4 (accuracyin ) dagger indicates reported results on leaderboard Uni-fiedQA uses T5-11B as text encoder whose number ofparameters is about 30 times more than other models

graph with 34 relation types (details in AppendixA) To extract an informative contextualized graphG from KG we recognize entity mentions in s andlink them to entities in ConceptNet with which weinitialize our node set V We then add to V all theentities that appear in any two-hop paths betweenpairs of mentioned entities Unlike KagNet we donot perform any pruning but instead reserve all theedges between nodes in V forming our G

52 Datasets

We evaluate models on two multiple-choice ques-tion answering datasets CommonsenseQA andOpenbookQA Both require world knowledge be-yond textual understanding to perform well

CommonsenseQA (Talmor et al 2019) neces-sitates various commonsense reasoning skills Thequestions are created with entities from ConceptNetand they are designed to probe latent compositionalrelations between entities in ConceptNet

OpenBookQA (Mihaylov et al 2018) provides

4Models based on ConceptNet are no longer shown on theleaderboard and we got our results from the organizers

elementary science questions together with an openbook of science facts This dataset also probesgeneral common sense beyond the provided facts

53 Compared Methods

We implement both knowledge-agnostic fine-tuning of pre-trained LMs and models that incor-porate KG as external sources as our baselinesAdditionally we directly compare our model withthe results from corresponding leaderboard Thesemethods typically leverage textual knowledge orextra training data as opposed to external KG Inall our implemented models we use pre-trainedLMs as text encoders for s for fair comparisonWe do compare our models with those (Ma et al2019 Lv et al 2019 Khashabi et al 2020) aug-mented by other text-form external knowledge (egWikipedia) although we stick to our focus of en-coding structured KG

Specifically we fine-tune BERT-BASE BERT-LARGE (Devlin et al 2019) and ROBERTA (Liuet al 2019b) for multiple-choice questions Wetake RGCN (Eq 2 in sect3) RN5 (Eq 3 in sect3)KagNet (Eq 4 in sect3) and GconAttn (Wanget al 2019) as baselines GconAttn generalizesmatch-LSTM (Wang and Jiang 2016) and achievessuccess in language inference tasks

6 Results and Discussions

In this section we present the results of our modelsin comparison with baselines as well as methodson the leaderboards for both CommonsenseQA andOpenbookQA We also provide analysis of modelsrsquocomponents and characteristics

5We use mean pooling for 1-hop RN and attentive poolingfor 2-hop RN (detailed in Appendix E)

Methods Dev () Test ()

T5-3Bdagger (Raffel et al 2019) - 8320UnifiedQAdagger (Khashabi et al 2020) - 8720

RoBERTa-Large (wo KG) 6676 (plusmn114) 6480 (plusmn237)+ RGCN 6465 (plusmn196) 6245 (plusmn157)+ GconAttn 6430 (plusmn099) 6190 (plusmn244)+ RN (1-hop) 6485 (plusmn111) 6365 (plusmn231)+ RN (2-hop) 6700 (plusmn071) 6520 (plusmn118)+ MHGRN (K = 3) 6810 (plusmn102) 6685 (plusmn119)

AristoRoBERTaV7dagger 792 778+ MHGRN (K = 3) 786 806

Table 5 Performance comparison on OpenbookQAdagger indicates reported results on leaderboard T5-3B is 8times larger than our models UnifiedQA is 30x larger

61 Main ResultsFor CommonsenseQA (Table 3) we first use thein-house data split (IH) of Lin et al (2019) (cfAppendix B) to compare our models with imple-mented baselines This is different from the offi-cial split used in the leaderboard methods Almostall KG-augmented models achieve performancegain over vanilla pre-trained LMs demonstratingthe value of external knowledge on this dataset Ad-ditionally we evaluate our MHGRN (with text en-coder being ROBERTA-LARGE) on official splitOF (Table 4) for fair comparison with other meth-ods on leaderboard in both single-model settingand ensemble-model setting With backbone be-ing T5 (Raffel et al 2019) UnifiedQA (Khashabiet al 2020) tops the leaderboard Consideringits training cost we do not intend to compare ourROBERTA-based model with it We achieve thebest performances among other models

For OpenbookQA (Table 5) we use official splitand build models with ROBERTA-LARGE as textencoder MHGRN surpasses all implemented base-lines with an absolute increase of sim2 on TestAlso as our approach is naturally compatible withthe methods that utilize textual knowledge or ex-tra data because in our paradigm the encodingof textual statement and graph are structurally-decoupled (Fig 3) To empirically show MHGRNcan bring gain over textual-knowledge empow-ered systems we replace our text encoder withAristoRoBERTaV76 and fine-tune our MHGRNupon OpenbookQA Empirically MHGRN contin-ues to bring benefit to strong-performing textual-knowledge empowered systems This indicatestextual knowledge and structured knowledge canpotentially be complementary

6httpsleaderboardallenaiorgopen_book_qasubmissionblcp1tu91i4gm0vf484g

Methods IHdev-Acc ()

MHGRN (K = 3) 7445 (plusmn010)- Type-specific transformation (sect41) 7316 (plusmn028)- Structured relational attention (sect42) 7326 (plusmn031)- Relation type attention (sect42) 7355 (plusmn068)- Node type attention (sect42) 7392 (plusmn065)

Table 6 Ablation study on model components (re-moving one component each time) using ROBERTA-LARGE as the text encoder We report the IHdev ac-curacy on CommonsenseQA

62 Performance Analysis

Ablation Study on Model Components Asshown in Table 6 disabling type-specific trans-formation results in sim 13 drop in performancedemonstrating the need for distinguishing nodetypes for QA tasks Our structured relational at-tention mechanism is also critical with its twosub-components contributing almost equally

Impact of the Amount of Training Data Weuse different fractions of training data of Common-senseQA and report results of fine-tuning text en-coders alone and jointly training text encoder andgraph encoder in Fig 5 Regardless of training datafraction our model shows consistently more per-formance improvement over knowledge-agnosticfine-tuning compared with the other graph encod-ing methods indicating MHGRNrsquos complementarystrengths to text encoders

20 40 60 80 100Percentage of Training Data

45

50

55

60

65

Accu

racy

()

MultiGRNKVMGconAttnwo KG

Figure 5 Performance change (accuracy in ) wrtthe amounts of training data on CommonsenseQA IHT-est set (same as Lin et al (2019))

Impact of Number of Hops (K) We investigatethe impact of hyperparameter K for MHGRN byits performance on CommonsenseQA (Fig 6) Theincrease ofK continues to bring benefits untilK =

4 However performance begins to drop whenK gt 3 This might be attributed to exponentialnoise in longer relational paths in knowledge graph

65 70 75Accuracy ()

K=1

K=2

K=3

K=4

K=5

Reasoning Hops

7324

7393

7465

7365

7416

Figure 6 Effect ofK in MHGRN We show IHDev ac-curacy of MHGRN on CommonsenseQA wrt hops

2 4 6 800

05

10

15

20

sba

tch

K Reasoning Hops

RGCNMHGRN

Figure 7 Analysis of model scalability Comparisonof per-batch training efficiency wrt hops K

63 Model ScalabilityFig 7 presents the computation cost of MHGRNand RGCN (measured by training time) Both growlinearly wrt K Although the theoretical complex-ity of MultiRGN ism times that of RGCN the ratioof their empirical cost only approaches 2 demon-strating that our model can be better parallelized

64 Model InterpretabilityWe can analyze our modelrsquos reasoning process bydecoding the reasoning path using the method de-scribed in sect45 Fig 8 shows two examples fromCommonsenseQA where our model correctly an-swers the questions and provides reasonable pathevidences In the example on the left the modellinks question entities and answer entity in a chainto support reasoning while the example on the rightprovides a case where our model leverage unmen-tioned entities to bridge the reasoning gap betweenquestion entity and answer entities in a way that iscoherent with the latent relation between CHAPEL

and the desired answer in the question

7 Related WorkKnowledge-Aware Methods for NLP Variouswork have investigated the potential to empowerNLP models with external knowledge Many at-tempt to extract structured knowledge either in theform of nodes (Yang and Mitchell 2017 Wanget al 2019) triples (Weissenborn et al 2017

Where is known for a multitude of wedding chapels

A town B texasC city D church building E Nevada

RenoChapel

PartOf

AtLocation

UsedForminus1 Synonym

AtLocation Synonym

Neveda

Baseball

Play_Baseball

HasProperty

UsedForminus1

Fun_to_Play

Why do parents encourage their kids to play baseball

A round B cheap C break window D hardE fun to play Fun_to_Play

HasProperty

Reno

Chapel

AtLocation

Neveda

PartOf

MultiGRNFigure 8 Case study on model interpretability Wepresent two sample questions from CommonsenseQAwith the reasoning paths output by MHGRN

Mihaylov and Frank 2018) paths (Bauer et al2018 Kundu et al 2019 Lin et al 2019) or sub-graphs (Li and Clark 2015) and encode them toaugment textual understanding

Recent success of pre-trained LMs motivatesmany (Pan et al 2019 Ye et al 2019 Zhang et al2018 Li et al 2019 Banerjee et al 2019) to probeLMsrsquo potential as latent knowledge bases This lineof work turn to textual knowledge (eg Wikipedia)to directly impart knowledge to pre-trained LMsThey generally fall into two paradigms 1) Fine-tuning LMs on large-scale general-domain datasets(eg RACE (Lai et al 2017)) or on knowledge-richtext 2) Providing LMs with evidence via informa-tion retrieval techniques However these modelscannot provide explicit reasoning and evidencethus hardly trustworthy They are also subject tothe availability of in-domain datasets and maxi-mum input token of pre-trained LMs

Neural Graph Encoding Graph Attention Net-works (GAT) (Velickovic et al 2018) incorpo-rates attention mechanism in feature aggregationRGCN (Schlichtkrull et al 2018) proposes rela-tional message passing which makes it applicableto multi-relational graphs However they only per-form single-hop message passing and cannot beinterpreted at path level Other work (Abu-El-Haijaet al 2019 Nikolentzos et al 2019) aggregate fora node its K-hop neighbors based on node-wisedistances but they are designed for non-relationalgraphs MHGRN addresses these issues by rea-soning on multi-relational graphs and being inter-pretable via maintaining paths as reasoning chains

8 Conclusion

We present a principled scalable method MHGRNthat can leverage general knowledge via multi-hopreasoning over interpretable structures (eg Con-ceptNet) The proposed MHGRN generalizes andcombines the advantages of GNNs and path-basedreasoning models It explicitly performs multi-hop

relational reasoning and is empirically shown tooutperform existing methods with superior scal-ablility and interpretability

ReferencesSami Abu-El-Haija Bryan Perozzi Amol Kapoor

Nazanin Alipourfard Kristina Lerman Hrayr Haru-tyunyan Greg Ver Steeg and Aram Galstyan 2019Mixhop Higher-order graph convolutional architec-tures via sparsified neighborhood mixing In Pro-ceedings of the 36th International Conference onMachine Learning ICML 2019 9-15 June 2019Long Beach California USA volume 97 of Pro-ceedings of Machine Learning Research pages 21ndash29 PMLR

Pratyay Banerjee Kuntal Kumar Pal Arindam Mi-tra and Chitta Baral 2019 Careful selection ofknowledge to solve open book question answeringIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages6120ndash6129 Florence Italy Association for Compu-tational Linguistics

Lisa Bauer Yicheng Wang and Mohit Bansal 2018Commonsense for generative multi-hop question an-swering tasks In Proceedings of the 2018 Confer-ence on Empirical Methods in Natural LanguageProcessing Brussels Belgium October 31 - Novem-ber 4 2018 pages 4220ndash4230 Association for Com-putational Linguistics

Peter Clark Isaac Cowhey Oren Etzioni Tushar KhotAshish Sabharwal Carissa Schoenick and OyvindTafjord 2018 Think you have solved questionanswering try arc the AI2 reasoning challengeCoRR abs180305457

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

Justin Gilmer Samuel S Schoenholz Patrick F RileyOriol Vinyals and George E Dahl 2017 Neuralmessage passing for quantum chemistry In Pro-ceedings of the 34th International Conference onMachine Learning ICML 2017 Sydney NSW Aus-tralia 6-11 August 2017 volume 70 of Proceedingsof Machine Learning Research pages 1263ndash1272PMLR

Daniel Khashabi Tushar Khot Ashish SabharwalOyvind Tafjord Peter Clark and Hannaneh Ha-jishirzi 2020 Unifiedqa Crossing format bound-aries with a single qa system

Thomas N Kipf and Max Welling 2017 Semi-supervised classification with graph convolutionalnetworks In 5th International Conference on Learn-ing Representations ICLR 2017 Toulon FranceApril 24-26 2017 Conference Track ProceedingsOpenReviewnet

Souvik Kundu Tushar Khot Ashish Sabharwal andPeter Clark 2019 Exploiting explicit paths formulti-hop reading comprehension In Proceedingsof the 57th Annual Meeting of the Association forComputational Linguistics pages 2737ndash2747 Flo-rence Italy Association for Computational Linguis-tics

John D Lafferty Andrew McCallum and FernandoC N Pereira 2001 Conditional random fieldsProbabilistic models for segmenting and labeling se-quence data In Proceedings of the Eighteenth Inter-national Conference on Machine Learning (ICML2001) Williams College Williamstown MA USAJune 28 - July 1 2001 pages 282ndash289 MorganKaufmann

Guokun Lai Qizhe Xie Hanxiao Liu Yiming Yangand Eduard Hovy 2017 RACE Large-scale ReAd-ing comprehension dataset from examinations InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing pages785ndash794 Copenhagen Denmark Association forComputational Linguistics

Zhen-Zhong Lan Mingda Chen Sebastian Good-man Kevin Gimpel Piyush Sharma and RaduSoricut 2019 Albert A lite bert for self-supervised learning of language representationsArXiv abs190911942

Shiyang Li Jianshu Chen and Dian Yu 2019 Teach-ing pretrained models with commonsense reason-ing A preliminary kb-based approach CoRRabs190909743

Yang Li and Peter Clark 2015 Answering elementaryscience questions by constructing coherent scenesusing background knowledge In Proceedings ofthe 2015 Conference on Empirical Methods in Nat-ural Language Processing pages 2007ndash2012 Lis-bon Portugal Association for Computational Lin-guistics

Bill Yuchen Lin Xinyue Chen Jamin Chen and Xi-ang Ren 2019 KagNet Knowledge-aware graphnetworks for commonsense reasoning In Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) pages 2829ndash2839 HongKong China Association for Computational Lin-guistics

Liyuan Liu Haoming Jiang Pengcheng He WeizhuChen Xiaodong Liu Jianfeng Gao and Jiawei Han2019a On the variance of the adaptive learning rateand beyond CoRR abs190803265

Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019bRoberta A robustly optimized BERT pretraining ap-proach CoRR abs190711692

Shangwen Lv Daya Guo Jingjing Xu Duyu TangNan Duan Ming Gong Linjun Shou Daxin JiangGuihong Cao and Songlin Hu 2019 Graph-based reasoning over heterogeneous external knowl-edge for commonsense question answering CoRRabs190905311

Kaixin Ma Jonathan Francis Quanyang Lu Eric Ny-berg and Alessandro Oltramari 2019 Towardsgeneralizable neuro-symbolic systems for common-sense question answering In Proceedings of theFirst Workshop on Commonsense Inference in Natu-ral Language Processing pages 22ndash32 Hong KongChina Association for Computational Linguistics

Todor Mihaylov Peter Clark Tushar Khot and AshishSabharwal 2018 Can a suit of armor conduct elec-tricity a new dataset for open book question an-swering In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processingpages 2381ndash2391 Brussels Belgium Associationfor Computational Linguistics

Todor Mihaylov and Anette Frank 2018 Knowledge-able reader Enhancing cloze-style reading compre-hension with external commonsense knowledge InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 821ndash832 Melbourne AustraliaAssociation for Computational Linguistics

Giannis Nikolentzos George Dasoulas and MichalisVazirgiannis 2019 k-hop graph neural networksCoRR abs190706051

Xiaoman Pan Kai Sun Dian Yu Heng Ji and Dong Yu2019 Improving question answering with externalknowledge CoRR abs190200993

Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2019 Exploring the limitsof transfer learning with a unified text-to-text trans-former

Adam Santoro David Raposo David G T BarrettMateusz Malinowski Razvan Pascanu Peter WBattaglia and Tim Lillicrap 2017 A simple neu-ral network module for relational reasoning In Ad-vances in Neural Information Processing Systems30 Annual Conference on Neural Information Pro-cessing Systems 2017 4-9 December 2017 LongBeach CA USA pages 4967ndash4976

Maarten Sap Hannah Rashkin Derek Chen RonanLe Bras and Yejin Choi 2019 Social IQa Com-monsense reasoning about social interactions InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the

9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) pages 4463ndash4473 Hong Kong China Association for Computa-tional Linguistics

Michael Sejr Schlichtkrull Thomas N Kipf PeterBloem Rianne van den Berg Ivan Titov and MaxWelling 2018 Modeling relational data with graphconvolutional networks In The Semantic Web - 15thInternational Conference ESWC 2018 HeraklionCrete Greece June 3-7 2018 Proceedings volume10843 of Lecture Notes in Computer Science pages593ndash607 Springer

Robyn Speer Joshua Chin and Catherine Havasi 2017Conceptnet 55 An open multilingual graph of gen-eral knowledge In Proceedings of the Thirty-FirstAAAI Conference on Artificial Intelligence Febru-ary 4-9 2017 San Francisco California USApages 4444ndash4451 AAAI Press

Alon Talmor Jonathan Herzig Nicholas Lourie andJonathan Berant 2019 CommonsenseQA A ques-tion answering challenge targeting commonsenseknowledge In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4149ndash4158 Minneapolis Minnesota Associ-ation for Computational Linguistics

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez LukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems 30 Annual Conference on NeuralInformation Processing Systems 2017 4-9 Decem-ber 2017 Long Beach CA USA pages 5998ndash6008

Petar Velickovic Guillem Cucurull Arantxa CasanovaAdriana Romero Pietro Lio and Yoshua Bengio2018 Graph attention networks In 6th Inter-national Conference on Learning RepresentationsICLR 2018 Vancouver BC Canada April 30 - May3 2018 Conference Track Proceedings OpenRe-viewnet

Shuohang Wang and Jing Jiang 2016 Learning nat-ural language inference with LSTM In Proceed-ings of the 2016 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics Human Language Technologies pages1442ndash1451 San Diego California Association forComputational Linguistics

Xiaoyan Wang Pavan Kapanipathi Ryan MusaMo Yu Kartik Talamadupula Ibrahim AbdelazizMaria Chang Achille Fokoue Bassem MakniNicholas Mattei and Michael Witbrock 2019 Im-proving natural language inference using externalknowledge in the science questions domain In TheThirty-Third AAAI Conference on Artificial Intelli-gence AAAI 2019 The Thirty-First Innovative Ap-plications of Artificial Intelligence Conference IAAI2019 The Ninth AAAI Symposium on Educational

Advances in Artificial Intelligence EAAI 2019 Hon-olulu Hawaii USA January 27 - February 1 2019pages 7208ndash7215 AAAI Press

Dirk Weissenborn Tomas Kocisky and Chris Dyer2017 Dynamic integration of background knowl-edge in neural nlu systems arXiv preprintarXiv170602596

An Yang Quan Wang Jing Liu Kai Liu Yajuan LyuHua Wu Qiaoqiao She and Sujian Li 2019 En-hancing pre-trained language representations withrich knowledge for machine reading comprehensionIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2346ndash2357 Florence Italy Association for Compu-tational Linguistics

Bishan Yang and Tom Mitchell 2017 Leveragingknowledge bases in LSTMs for improving machinereading In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1 Long Papers) pages 1436ndash1446 Van-couver Canada Association for Computational Lin-guistics

Zhi-Xiu Ye Qian Chen Wen Wang and Zhen-HuaLing 2019 Align mask and select A sim-ple method for incorporating commonsense knowl-edge into language representation models CoRRabs190806725

Yuyu Zhang Hanjun Dai Kamil Toraman andLe Song 2018 Kgˆ2 Learning to reason scienceexam questions with contextual knowledge graphembeddings CoRR abs180512393

Chen Zhu Yu Cheng Zhe Gan Siqi Sun Tom Gold-stein and Jingjing Liu 2020 Freelb Enhanced ad-versarial training for natural language understandingIn International Conference on Learning Represen-tations

A Merging Types of Relations inConceptNet

Relation Merged Relation

AtLocationAtLocation

LocatedNear

CausesCausesCausesDesire

MotivatedByGoal

AntonymAntonym

DistinctFrom

HasSubevent

HasSubevent

HasFirstSubeventHasLastSubeventHasPrerequisite

EntailsMannerOf

IsAIsAInstanceOf

DefinedAs

PartOfPartOf

HasA

RelatedToRelatedToSimilarTo

Synonym

Table 7 Relations in ConceptNet that are being mergedin pre-processing RelationX indicates the reverse re-lation of RelationX

We merge relations that are close in semanticsas well as in the general usage of triple instancesin ConceptNet

B Dataset Split Specifications

Train Dev Test

CommonsenseQA (OF) 9 741 1 221 1 140CommonsenseQA (IH) 8 500 1 221 1 241OpenbookQA 4 957 500 500

Table 8 Numbers of instances in different datasetsplits

CommonsenseQA 7 and OpenbookQA 8 all havetheir leaderboards with training and development

7httpswwwtau-nlporgcommonsenseqa8httpsleaderboardallenaiorgopen_

book_qasubmissionspublic

set publicly available As the ground truth labelsfor CommonsenseQA are not readily available formodel analysis we take 1241 examples from of-ficial training examples as our in-house test ex-amples and regard the remaining 8500 ones asour in-house training examples (CommonsenseQA(IH))

C Implementation Details

CommonsenseQA OpenbookQA

BERT-BASE 3 times 10minus5 -

BERT-LARGE 2 times 10minus5 -

ROBERTA-LARGE 1 times 10minus5

1 times 10minus5

Table 9 Learning rate for text encoders on differentdatasets

CommonsenseQA OpenbookQA

RN 3 times 10minus4

3 times 10minus4

RGCN 1 times 10minus3

1 times 10minus3

GconAttn 3 times 10minus4

1 times 10minus4

MHGRN 1 times 10minus3

1 times 10minus3

Table 10 Learning rate for graph encoders on differentdatasets

Param

RN 399KRGCN 365KGconAttn 453KMHGRN 544K

Table 11 Numbers of parameters of different graph en-coders

Our models are implemented in PyTorch We usecross-entropy loss and RAdam (Liu et al 2019a)optimizer We find it beneficial to use separatelearning rates for the text encoder and the graph en-coder We tune learning rates for text encoders andgraph encoders on two datasets We first fine-tuneROBERTA-LARGE BERT-LARGE BERT-BASE

on CommonsenseQA and ROBERTA-LARGE onOpenbookQA respectively and choose a dataset-specific learning rate from 1times10

minus5 2times10

minus5 3times

10minus5 6 times 10

minus5 1 times 10

minus4 for each text encoderbased on the best performance on development setas listed in Table 9 We report the performance ofthese fine-tuned text encoders and also adopt theirdataset-specific optimal learning rates in joint train-ing with graph encoders For models that involve

KG the learning rate of their graph encoders arechosen from 1 times 10

minus4 3 times 10

minus4 1 times 10

minus3 3 times

10minus3 based on their best development set perfor-

mance with ROBERTA-LARGE as the text encoderWe report the optimal learning rates for graph en-coders in Table 10 In training we set the max-imum input sequence length to text encoders to64 batch size to 32 and perform early stoppingAristoRoBERTaV7+MHGRN is the only exceptionIn order to host fair comparison we follow Aris-toRoBERTaV7 and set the batch size to 16 maxinput sequence length to 256 and choose a decoderlearning rate from 1 times 10

minus3 2 times 10

minus5For the input node features we first use tem-

plates to turn knowledge triples in ConceptNet intosentences and feed them into pre-trained BERT-LARGE obtaining a sequence of token embeddingsfrom the last layer of BERT-LARGE for each tripleFor each entity in ConceptNet we perform meanpooling over the tokens of the entityrsquos occurrencesacross all the sentences to form a 1024d vector asits corresponding node feature We use this set offeatures for all our implemented models

We use 2-layer RGCN and single-layer MHGRNacross our experiments

The numbers of parameter for each graph en-coder are listed in Table 11

D Dynamic Programming Algorithm forEq 7

To show that multi-hop message passing can becomputed in linear time we observe that Eq 7 canbe re-written in matrix form

Zk= (Dk)minus1 sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FXW1r1

⊤⋯W

krk

sdotWk+10

⊤⋯W

K0

⊤ (1 le k le K) (13)

where G = diag(exp([g(φ(v1) s) g(φ(vn)s)]) (F is similarly defined) Ar is the adjacencymatrix for relation r and D

k is defined as follows

Dk= diag( sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FX1) (1 le k le K) (14)

Using this matrix formulation we can compute Eq7 using dynamic programming

Algorithm 1 Dynamic programming algorithm formulti-hop message passingInput sXAr(1 le r le m)W t

r (r isin R 1 le t le

k)F G δ τOutput Z

1 WKlarr I

2 for k larr K minus 1 to 1 do3 W

klarrW

k+10 W

k+1

4 end for5 for r isin R do6 Mr larr FX7 end for8 for k larr 1 to K do9 if k gt 1 then

10 for r isin R do11 M

primer larr e

δ(rs)ArsumrprimeisinR e

τ(rprimers)MrprimeW

kr

12 end for13 for r isin R do14 Mr larrM

primer

15 end for16 else17 for r isin R do18 Mr larr e

δ(rs) sdotArMrWkr

19 end for20 end if21 Z

klarr GsumrisinR MrW

k

22 end for23 Replace W

tr (0 le r le m 1 le t le k) with identity

matrices and X with 1 and re-run line 1 - line 19 tocompute d

1 d

K

24 for k larr 1 to K do25 Z

klarr (diag(dk))minus1

Zk

26 end for27 return Z

1Z

2 Z

k

E Formal Definition of K-hop RN

Definition 1 (K-hop Relation Network) A multi-hop relation network is a function that maps amulti-relational graph to a fixed size vector

KHopRN(G W E H) =K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)sdotW (hjoplus(er1⋯erk)oplushi)(15)

where denotes element-wise product and β(⋯) =1(K∣A∣ sdot ∣(j prime i) isin G ∣ j prime isin Q∣) defines thepooling weights

F Expressing K-hop RN with MultiGRN

Theorem 1 Given any W E H there exists a pa-rameter setting such that the output of the modelbecomes KHopRN(GW E H) for arbitrary G

Proof Suppose W = [W1 W2 W3] whereW1 W3 isin Rd3timesd1 W2 isin Rd3timesd2 ForMHRGN we set the parameters as followsH = HUlowast = [I0] isin R(d1+d2)timesd1 blowast =

[01]⊤ isin Rd1+d2 Wtr = diag(1 oplus er) isin

R(d1+d2)times(d1+d2)(r isin R 1 le t le K)V = W3 isin

Rd3timesd1 Vprime= [W1 W2] isin Rd3times(d1+d2) We dis-

able the relation type attention module and enablemessage passing only from Q to A By furtherchoosing σ as the identity function and perform-ing pooling over A we observe that the output of

MultiGRN becomes

1

∣A∣ sumiisinA

hprimei

=1

∣A∣ (V hi + Vprimezi)

=1

K∣A∣K

sumk=1

(V hi + Vprimezki )

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)(V hi+

VprimeW

krk⋯W

1r1xj)

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ

β(⋯)(V hi + VprimeW

krk

⋯W1r1Uφ(j)hj + V

primeW

krk⋯W

1r1bφ(j))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)(W3hi + W1hj

+ W2(er1 ⋯ erk))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)W (hjoplus

(er1 ⋯ erk)oplus hi)= RN(G W E H)

(16)

Page 3: arXiv:2005.00646v1 [cs.CL] 1 May 20200 80 #node 3.1e3 #path K=2 0 80 #node 6.3e4 K=2 K=3 Figure 2: Number of K-hop relational paths w.r.t. the node count in extracted graphs on Common-senseQA

PlausibilityScore

Graph ExtractionFrom KG

Graph Encoder

Text Encoder

120022

119956119954

119938

Figure 3 Overview of the knowledge-aware QAframework It integrates the output from graph en-coder (for relational reasoning over contextual sub-graphs) and text encoder (for textual understanding) togenerate the plausibility score for an answer option

pre-defined relation types The mapping functionφ(i) ∶ V rarr T = EqEaEo takes node i isin V asinput and outputs Eq if i is an entity mentioned inq Ea if it is mentioned in a or Eo otherwise Wefinally encode the statement to s G to g concate-nate s and g for calculating the plausibility score

3 Background Multi-Relational GraphEncoding Methods

We leave encoding of s to pre-trained languagemodels and focus on the challenge of encodinggraph G to capture latent relations between enti-ties Current methods for encoding multi-relationalgraphs mainly fall into two categories GNNs andpath-based models GNNs encode structured in-formation by passing messages between nodes di-rectly operating on the graph structure while path-based methods first decompose the graph into pathsand then pool features over them

Graph Encoding with GNNs For a graph withn nodes a graph neural network (GNN) takes aset of node features h1h2 hn as input andcomputes their corresponding node embeddingshprime1hprime2 hprimen via message passing (Gilmeret al 2017) A compact graph representation forG can thus be obtained by pooling over the nodeembeddings hprimei

GNN(G) = Pool(hprime1hprime2 hprimen) (1)

As a notable variant of GNNs graph convo-lutional networks (GCNs) (Kipf and Welling2017) additionally update node embeddings byaggregating messages from its direct neighborsRGCNs (Schlichtkrull et al 2018) extend GCNs toencode multi-relational graphs by defining relation-

specific weight matrix Wr for each edge type

hprimei = σ

⎛⎜⎝(sumrisinR

∣N ri ∣)

minus1

sumrisinR

sumjisinN r

i

Wrhj⎞⎟⎠ (2)

where N ri denotes neighbors of node i under rela-

tion r2

While GNNs have proved to have good scala-bility their reasoning is done at the node levelmaking them incompatible with modeling pathsThis property also hinders the modelrsquos decisionsfrom being interpretable at the path levelGraph Encoding with Path-Based Models Inaddition to directly modeling the graph with GNNsa graph can also be viewed as a set of relationalpaths connecting pairs of entities

Relation Networks (RNs) (Santoro et al 2017)can be adapted to multi-relational graph encodingunder QA settings RNs use MLPs to encode alltriples (one-hop paths) in G whose head entity isin Q = j ∣ φ(j) = Eq and tail entity is inA = i ∣ φ(i) = Ea It then pools the tripleembeddings to generate a vector for G as follows

RN(G) = Pool(MLP(hj oplus eroplus

hi) ∣ j isin Q i isin A (j r i) isin E) (3)

Here hj and hi are features for nodes j and i er isthe embedding of relation r isin R oplus denotes vectorconcatenation

To further equip RN with the ability to modelnondegenerate paths KagNet (Lin et al 2019)adopts LSTMs to encode all paths connecting ques-tion entities and answer entities with lengths nomore than K It then aggregates all path embed-dings via attention mechanism

KAGNET(G) = Pool(LSTM(j r1 rk i) ∣

(j r1 j1)⋯ (jkminus1 rk i) isin E 1 le k le K)(4)

4 Proposed Method Multi-Hop GraphRelation Network (MHGRN)

This section presents Multi-hop Graph RelationNetwork (MHGRN) a novel GNN architecturethat unifies both GNNs and path-based modelsMHGRN inherits path-level reasoning and inter-pretabilty from path-based models while preserv-ing good scalability of GNNs

2For simplicity we assume a single graph convolutional

AttentivePooling

2 Multi-hop Message Passing

Node Features

Text Encodereg BERT

1 Node Type Specific Transform

3 Nonlinear Activation

timesLayer

PlausibilityScore

MLP

120022 119956

AggregateMessages

120630(119947 119955120783 hellip 119955119948 119946)

Structured Relational Att

Learn

Child

Child CapableOf UsedForminus1 AtLocation

Desires UsedForminus1 Synonym

AtLocation Synonym

Sit AtLocationUsedForminus1

Desk AtLocation

hellipMulti-hop Message Passing

Figure 4 Our proposed MHGRN architecturefor relational reasoning MHGRN takes a multi-relational graph G and a (question-answer) statementvector s as input and outputs a scalar that represent theplausibility score of this statement

41 MHGRN Model ArchitectureWe follow the GNN framework introduced in sect3where node features can be initialized with pre-trained weights (details in Appendix C) Here wefocus on the computation of node embeddings

Type-Specific Transformation To make ourmodel aware of the node type φ we first performnode type specific linear transformation on the in-put node features

xi = Uφ(i)hi + bφ(i) (5)

where the learnable parameters U and b are specificto the type of node i

Multi-Hop Message Passing As mentioned be-fore our motivation is to endow GNNs with thecapability of directly modeling paths To this endwe propose to pass messages directly over all therelational paths of lengths up to K The set of validk-hop relational paths is defined as

Φk = (j r1 rk i) ∣ (j r1 j1)⋯ (jkminus1 rk i) isin E (1 le k le K) (6)

We perform k-hop (1 le k le K) message passingover these paths which is a generalization of the

layer In practice multiple layers are stacked to enable mes-sage passing from multi-hop neighbors

single-hop message passing in RGCNs (see Eq 2)

zki = sum

(jr1rki)isinΦk

α(j r1 rk i)dki sdotWK0

⋯Wk+10 W

krk⋯W

1r1xj (1 le k le K) (7)

where the Wtr (1 le t le K 0 le r le m) matrices

are learnable3 α(j r1 rk i) is an attentionscore elaborated in sect42 and d

ki = sum(j⋯i)isinΦk

α(j⋯i)is the normalization factor The W k

rk⋯W1r1 ∣ 1 le

r1 rk le m matrices can be interpreted as thelow rank approximation of a mtimes⋯timesmktimesdtimesdtensor that assigns a separate transformation foreach k-hop relation where d is the dim of xi

Incoming messages from paths of differentlengths are aggregated via attention mecha-nism (Vaswani et al 2017)

zi =K

sumk=1

softmax(bilinear(s zki )) sdot zki (8)

Non-linear Activation Finally we apply shortcutconnection and nonlinear activation to obtain theoutput node embeddings

hprimei = σ (V hi + V

primezi) (9)

where V and Vprime are learnable model parameters

and σ is a non-linear activation function

42 Structured Relational AttentionHere we work towards effectively parameterizingthe attention score α(j r1 rk i) in Eq 7 for allk-hop paths without introducing O(mk) parametersWe first regard it as the probability of a relationsequence (φ(j) r1 rk φ(i)) conditioned on s

α(j r1 rk i) = p (φ(j) r1 rk φ(i) ∣ s) (10)

which can naturally be modeled by a probabilisticgraphical model such as conditional random field(Lafferty et al 2001)

p (⋯ ∣ s)prop exp(f(φ(j) s) +k

sumt=1

δ(rt s)

+kminus1

sumt=1

τ(rt rt+1) + g(φ(i) s))

∆= β(r1 rk s)IacuteOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveNtildeOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveIuml

Relation Type Attention

sdot γ(φ(j) φ(i) s)IacuteOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveNtildeOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveIumlNode Type Attention

(11)

3W

t0(0 le t le K) are introduced as padding matrices

so that K transformations are applied regardless of k thusensuring comparable scale of zki across different k

Model Time Space

G is a dense graph

K-hop KagNet O (mKnK+1

K) O (mKnK+1

K)K-layer RGCN O (mn2

K) O (mnK)MHGRN O (m2

n2K) O (mnK)

G is a sparse graph with maximum node degree ∆ ≪ n

K-hop KagNet O (mKnK∆

K) O (mKnK∆

K)K-layer RGCN O (mnK∆) O (mnK)MHGRN O (m2

nK∆) O (mnK)

Table 2 Computation complexity of different K-hopreasoning models on a densesparse multi-relationalgraph with n nodes and m relation types Despite thequadratic complexity wrt m MHGRNrsquos time cost issimilar to RGCN on GPUs with parallelizable matrixmultiplications (cf Fig 7)

where f(sdot) δ(sdot) and g(sdot) are parameterized bytwo-layer MLPs and τ(sdot) by a transition matrixof shape m timesm Intuitively β(sdot) models the im-portance of a k-hop relation while γ(sdot) models theimportance of messages from node type φ(j) toφ(i) (eg the model can learn to pass messagesonly from question entities to answer entities)

Our model scores a k-hop relation by decom-posing it into both context-aware single-hop re-lations (modeled by δ) and two-hop relations(modeled by τ ) We argue that τ is indis-pensable without which the model may assignhigh importance to illogical multi-hop relations(eg [AtLocation CapableOf]) or noisy re-lations (eg [RelatedTo RelatedTo])

43 Computation Complexity Analysis

Although message passing process in Eq 7 andattention module in Eq11 handles potentially ex-ponential number of paths it can be computed inlinear time with dynamic programming (see Ap-pendix D) As summarized in Table 2 time com-plexity and space complexity of MHGRN on asparse graph are both linear wrt either the maxi-mum path length K or the number of nodes n

44 Expressive Power of MHGRN

In addition to efficiency and scalability we now dis-cuss the modeling capacity of MHGRN With themessage passing formulation and relation-specifictransformations it is by nature the generalizationof RGCN It is also capable of directly modelingpaths making it interpretable as are path-basedmodels like RN and KagNet To show this we

first generalize RN (Eq 3) to the multi-hop set-ting and introduce K-hop RN (formal definition inAppendix E) which models multi-hop relation asthe composition of single-hop relations We showthat MHGRN is capable of representingK-hop RN(proof in Appendix F)

45 Learning Inference and Path DecodingWe now discuss the learning and inference processof MHGRN instantiated for QA tasks Followingthe problem formulation in sect2 we aim to deter-mine the plausibility of an answer option a isin Cgiven the question q with the information fromboth text s and graph G We first obtain the graphrepresentation g by performing attentive poolingover the output node embeddings of answer entitieshprimei ∣ i isin A Next we concatenate it with the textrepresentation s and compute the plausibility scoreby ρ(qa) = MLP(soplus g)

During training we maximize the plausibilityscore of the correct answer a by minimizing thecross-entropy loss

L = EqaC [minus logexp(ρ(q a))

sumaisinC exp(ρ(qa))] (12)

The whole model is trained end-to-end jointly withthe text encoder (eg RoBERTa)

During inference we predict the most plau-sible answer by argmaxaisinC ρ(qa) Addition-ally we can decode a reasoning path as evidencefor model predictions endowing our model withthe interpretability enjoyed by path-based mod-els Specifically we first determine the answerentity ilowast with the highest score in the pooling layerand the path length klowast with the highest score inEq 8 Then the reasoning path is decoded byargmax α(j r1 rklowast i

lowast) which can be com-puted in linear time using dynamic programming

5 Experimental SetupWe introduce how we construct G (sect51) thedatasets (sect52) as well as the baseline methods(sect53) Appendix C shows more implementationand experimental details for reproducibility

51 Extracting G from External KGWe use ConceptNet (Speer et al 2017) a general-domain knowledge graph as our external KG to testmodelsrsquo ability to harness structured knowledgesource Following KagNet (Lin et al 2019) wemerge relation types to increase graph density andadd reverse relations to construct a multi-relational

Methods BERT-Base BERT-Large RoBERTa-Large

IHdev-Acc() IHtest-Acc() IHdev-Acc() IHtest-Acc() IHdev-Acc() IHtest-Acc()

wo KG 5731 (plusmn107) 5347 (plusmn087) 6106 (plusmn085) 5539 (plusmn040) 7307 (plusmn045) 6869(plusmn056)

RGCN (Schlichtkrull et al 2018) 5694 (plusmn038) 5450 (plusmn056) 6298 (plusmn082) 5713 (plusmn036) 7269 (plusmn019) 6841 (plusmn066)GconAttn (Wang et al 2019) 5727 (plusmn070) 5484 (plusmn088) 6317 (plusmn018) 5736 (plusmn090) 7261( plusmn039) 6859 (plusmn096)KagNetdagger (Lin et al 2019) 5557 5619 6235 5716 - -RN (1-hop) 5827 (plusmn022) 5620 (plusmn045) 6304 (plusmn058) 5846 (plusmn071) 7457 (plusmn091) 6908 (plusmn021)RN (2-hop) 5981 (plusmn076) 5661 (plusmn068) 6336 (plusmn026) 5892 (plusmn014) 7365 (plusmn309) 6959 (plusmn380)

MHGRN 6036 (plusmn023) 5723 (plusmn082) 6329(plusmn051) 6059 (plusmn058) 7445 (plusmn010) 7111 (plusmn081)

Table 3 Performance comparison on CommonsenseQA in-house split We report in-house Dev (IHdev) andTest (IHtest) accuracy (mean and standard deviation of four runs) using the data split of Lin et al (2019) onCommonsenseQA dagger indicates reported results in its paper

Methods Single Ensemble

UnifiedQAdagger (Khashabi et al 2020) 791 -

RoBERTadagger 721 725RoBERTa + KEDGNdagger 725 744RoBERTa + KEdagger 733 -RoBERTa + HyKAS 20dagger (Ma et al 2019) 732 -RoBERTa + FreeLBdagger(Zhu et al 2020) 722 731XLNet + DREAMdagger 669 733XLNet + GRdagger (Lv et al 2019) 753 -ALBERTdagger (Lan et al 2019) - 765

RoBERTa + MHGRN (K = 2) 754 765

Table 4 Performance comparison on official test ofCommonsenseQA with leaderboard SoTAs4 (accuracyin ) dagger indicates reported results on leaderboard Uni-fiedQA uses T5-11B as text encoder whose number ofparameters is about 30 times more than other models

graph with 34 relation types (details in AppendixA) To extract an informative contextualized graphG from KG we recognize entity mentions in s andlink them to entities in ConceptNet with which weinitialize our node set V We then add to V all theentities that appear in any two-hop paths betweenpairs of mentioned entities Unlike KagNet we donot perform any pruning but instead reserve all theedges between nodes in V forming our G

52 Datasets

We evaluate models on two multiple-choice ques-tion answering datasets CommonsenseQA andOpenbookQA Both require world knowledge be-yond textual understanding to perform well

CommonsenseQA (Talmor et al 2019) neces-sitates various commonsense reasoning skills Thequestions are created with entities from ConceptNetand they are designed to probe latent compositionalrelations between entities in ConceptNet

OpenBookQA (Mihaylov et al 2018) provides

4Models based on ConceptNet are no longer shown on theleaderboard and we got our results from the organizers

elementary science questions together with an openbook of science facts This dataset also probesgeneral common sense beyond the provided facts

53 Compared Methods

We implement both knowledge-agnostic fine-tuning of pre-trained LMs and models that incor-porate KG as external sources as our baselinesAdditionally we directly compare our model withthe results from corresponding leaderboard Thesemethods typically leverage textual knowledge orextra training data as opposed to external KG Inall our implemented models we use pre-trainedLMs as text encoders for s for fair comparisonWe do compare our models with those (Ma et al2019 Lv et al 2019 Khashabi et al 2020) aug-mented by other text-form external knowledge (egWikipedia) although we stick to our focus of en-coding structured KG

Specifically we fine-tune BERT-BASE BERT-LARGE (Devlin et al 2019) and ROBERTA (Liuet al 2019b) for multiple-choice questions Wetake RGCN (Eq 2 in sect3) RN5 (Eq 3 in sect3)KagNet (Eq 4 in sect3) and GconAttn (Wanget al 2019) as baselines GconAttn generalizesmatch-LSTM (Wang and Jiang 2016) and achievessuccess in language inference tasks

6 Results and Discussions

In this section we present the results of our modelsin comparison with baselines as well as methodson the leaderboards for both CommonsenseQA andOpenbookQA We also provide analysis of modelsrsquocomponents and characteristics

5We use mean pooling for 1-hop RN and attentive poolingfor 2-hop RN (detailed in Appendix E)

Methods Dev () Test ()

T5-3Bdagger (Raffel et al 2019) - 8320UnifiedQAdagger (Khashabi et al 2020) - 8720

RoBERTa-Large (wo KG) 6676 (plusmn114) 6480 (plusmn237)+ RGCN 6465 (plusmn196) 6245 (plusmn157)+ GconAttn 6430 (plusmn099) 6190 (plusmn244)+ RN (1-hop) 6485 (plusmn111) 6365 (plusmn231)+ RN (2-hop) 6700 (plusmn071) 6520 (plusmn118)+ MHGRN (K = 3) 6810 (plusmn102) 6685 (plusmn119)

AristoRoBERTaV7dagger 792 778+ MHGRN (K = 3) 786 806

Table 5 Performance comparison on OpenbookQAdagger indicates reported results on leaderboard T5-3B is 8times larger than our models UnifiedQA is 30x larger

61 Main ResultsFor CommonsenseQA (Table 3) we first use thein-house data split (IH) of Lin et al (2019) (cfAppendix B) to compare our models with imple-mented baselines This is different from the offi-cial split used in the leaderboard methods Almostall KG-augmented models achieve performancegain over vanilla pre-trained LMs demonstratingthe value of external knowledge on this dataset Ad-ditionally we evaluate our MHGRN (with text en-coder being ROBERTA-LARGE) on official splitOF (Table 4) for fair comparison with other meth-ods on leaderboard in both single-model settingand ensemble-model setting With backbone be-ing T5 (Raffel et al 2019) UnifiedQA (Khashabiet al 2020) tops the leaderboard Consideringits training cost we do not intend to compare ourROBERTA-based model with it We achieve thebest performances among other models

For OpenbookQA (Table 5) we use official splitand build models with ROBERTA-LARGE as textencoder MHGRN surpasses all implemented base-lines with an absolute increase of sim2 on TestAlso as our approach is naturally compatible withthe methods that utilize textual knowledge or ex-tra data because in our paradigm the encodingof textual statement and graph are structurally-decoupled (Fig 3) To empirically show MHGRNcan bring gain over textual-knowledge empow-ered systems we replace our text encoder withAristoRoBERTaV76 and fine-tune our MHGRNupon OpenbookQA Empirically MHGRN contin-ues to bring benefit to strong-performing textual-knowledge empowered systems This indicatestextual knowledge and structured knowledge canpotentially be complementary

6httpsleaderboardallenaiorgopen_book_qasubmissionblcp1tu91i4gm0vf484g

Methods IHdev-Acc ()

MHGRN (K = 3) 7445 (plusmn010)- Type-specific transformation (sect41) 7316 (plusmn028)- Structured relational attention (sect42) 7326 (plusmn031)- Relation type attention (sect42) 7355 (plusmn068)- Node type attention (sect42) 7392 (plusmn065)

Table 6 Ablation study on model components (re-moving one component each time) using ROBERTA-LARGE as the text encoder We report the IHdev ac-curacy on CommonsenseQA

62 Performance Analysis

Ablation Study on Model Components Asshown in Table 6 disabling type-specific trans-formation results in sim 13 drop in performancedemonstrating the need for distinguishing nodetypes for QA tasks Our structured relational at-tention mechanism is also critical with its twosub-components contributing almost equally

Impact of the Amount of Training Data Weuse different fractions of training data of Common-senseQA and report results of fine-tuning text en-coders alone and jointly training text encoder andgraph encoder in Fig 5 Regardless of training datafraction our model shows consistently more per-formance improvement over knowledge-agnosticfine-tuning compared with the other graph encod-ing methods indicating MHGRNrsquos complementarystrengths to text encoders

20 40 60 80 100Percentage of Training Data

45

50

55

60

65

Accu

racy

()

MultiGRNKVMGconAttnwo KG

Figure 5 Performance change (accuracy in ) wrtthe amounts of training data on CommonsenseQA IHT-est set (same as Lin et al (2019))

Impact of Number of Hops (K) We investigatethe impact of hyperparameter K for MHGRN byits performance on CommonsenseQA (Fig 6) Theincrease ofK continues to bring benefits untilK =

4 However performance begins to drop whenK gt 3 This might be attributed to exponentialnoise in longer relational paths in knowledge graph

65 70 75Accuracy ()

K=1

K=2

K=3

K=4

K=5

Reasoning Hops

7324

7393

7465

7365

7416

Figure 6 Effect ofK in MHGRN We show IHDev ac-curacy of MHGRN on CommonsenseQA wrt hops

2 4 6 800

05

10

15

20

sba

tch

K Reasoning Hops

RGCNMHGRN

Figure 7 Analysis of model scalability Comparisonof per-batch training efficiency wrt hops K

63 Model ScalabilityFig 7 presents the computation cost of MHGRNand RGCN (measured by training time) Both growlinearly wrt K Although the theoretical complex-ity of MultiRGN ism times that of RGCN the ratioof their empirical cost only approaches 2 demon-strating that our model can be better parallelized

64 Model InterpretabilityWe can analyze our modelrsquos reasoning process bydecoding the reasoning path using the method de-scribed in sect45 Fig 8 shows two examples fromCommonsenseQA where our model correctly an-swers the questions and provides reasonable pathevidences In the example on the left the modellinks question entities and answer entity in a chainto support reasoning while the example on the rightprovides a case where our model leverage unmen-tioned entities to bridge the reasoning gap betweenquestion entity and answer entities in a way that iscoherent with the latent relation between CHAPEL

and the desired answer in the question

7 Related WorkKnowledge-Aware Methods for NLP Variouswork have investigated the potential to empowerNLP models with external knowledge Many at-tempt to extract structured knowledge either in theform of nodes (Yang and Mitchell 2017 Wanget al 2019) triples (Weissenborn et al 2017

Where is known for a multitude of wedding chapels

A town B texasC city D church building E Nevada

RenoChapel

PartOf

AtLocation

UsedForminus1 Synonym

AtLocation Synonym

Neveda

Baseball

Play_Baseball

HasProperty

UsedForminus1

Fun_to_Play

Why do parents encourage their kids to play baseball

A round B cheap C break window D hardE fun to play Fun_to_Play

HasProperty

Reno

Chapel

AtLocation

Neveda

PartOf

MultiGRNFigure 8 Case study on model interpretability Wepresent two sample questions from CommonsenseQAwith the reasoning paths output by MHGRN

Mihaylov and Frank 2018) paths (Bauer et al2018 Kundu et al 2019 Lin et al 2019) or sub-graphs (Li and Clark 2015) and encode them toaugment textual understanding

Recent success of pre-trained LMs motivatesmany (Pan et al 2019 Ye et al 2019 Zhang et al2018 Li et al 2019 Banerjee et al 2019) to probeLMsrsquo potential as latent knowledge bases This lineof work turn to textual knowledge (eg Wikipedia)to directly impart knowledge to pre-trained LMsThey generally fall into two paradigms 1) Fine-tuning LMs on large-scale general-domain datasets(eg RACE (Lai et al 2017)) or on knowledge-richtext 2) Providing LMs with evidence via informa-tion retrieval techniques However these modelscannot provide explicit reasoning and evidencethus hardly trustworthy They are also subject tothe availability of in-domain datasets and maxi-mum input token of pre-trained LMs

Neural Graph Encoding Graph Attention Net-works (GAT) (Velickovic et al 2018) incorpo-rates attention mechanism in feature aggregationRGCN (Schlichtkrull et al 2018) proposes rela-tional message passing which makes it applicableto multi-relational graphs However they only per-form single-hop message passing and cannot beinterpreted at path level Other work (Abu-El-Haijaet al 2019 Nikolentzos et al 2019) aggregate fora node its K-hop neighbors based on node-wisedistances but they are designed for non-relationalgraphs MHGRN addresses these issues by rea-soning on multi-relational graphs and being inter-pretable via maintaining paths as reasoning chains

8 Conclusion

We present a principled scalable method MHGRNthat can leverage general knowledge via multi-hopreasoning over interpretable structures (eg Con-ceptNet) The proposed MHGRN generalizes andcombines the advantages of GNNs and path-basedreasoning models It explicitly performs multi-hop

relational reasoning and is empirically shown tooutperform existing methods with superior scal-ablility and interpretability

ReferencesSami Abu-El-Haija Bryan Perozzi Amol Kapoor

Nazanin Alipourfard Kristina Lerman Hrayr Haru-tyunyan Greg Ver Steeg and Aram Galstyan 2019Mixhop Higher-order graph convolutional architec-tures via sparsified neighborhood mixing In Pro-ceedings of the 36th International Conference onMachine Learning ICML 2019 9-15 June 2019Long Beach California USA volume 97 of Pro-ceedings of Machine Learning Research pages 21ndash29 PMLR

Pratyay Banerjee Kuntal Kumar Pal Arindam Mi-tra and Chitta Baral 2019 Careful selection ofknowledge to solve open book question answeringIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages6120ndash6129 Florence Italy Association for Compu-tational Linguistics

Lisa Bauer Yicheng Wang and Mohit Bansal 2018Commonsense for generative multi-hop question an-swering tasks In Proceedings of the 2018 Confer-ence on Empirical Methods in Natural LanguageProcessing Brussels Belgium October 31 - Novem-ber 4 2018 pages 4220ndash4230 Association for Com-putational Linguistics

Peter Clark Isaac Cowhey Oren Etzioni Tushar KhotAshish Sabharwal Carissa Schoenick and OyvindTafjord 2018 Think you have solved questionanswering try arc the AI2 reasoning challengeCoRR abs180305457

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

Justin Gilmer Samuel S Schoenholz Patrick F RileyOriol Vinyals and George E Dahl 2017 Neuralmessage passing for quantum chemistry In Pro-ceedings of the 34th International Conference onMachine Learning ICML 2017 Sydney NSW Aus-tralia 6-11 August 2017 volume 70 of Proceedingsof Machine Learning Research pages 1263ndash1272PMLR

Daniel Khashabi Tushar Khot Ashish SabharwalOyvind Tafjord Peter Clark and Hannaneh Ha-jishirzi 2020 Unifiedqa Crossing format bound-aries with a single qa system

Thomas N Kipf and Max Welling 2017 Semi-supervised classification with graph convolutionalnetworks In 5th International Conference on Learn-ing Representations ICLR 2017 Toulon FranceApril 24-26 2017 Conference Track ProceedingsOpenReviewnet

Souvik Kundu Tushar Khot Ashish Sabharwal andPeter Clark 2019 Exploiting explicit paths formulti-hop reading comprehension In Proceedingsof the 57th Annual Meeting of the Association forComputational Linguistics pages 2737ndash2747 Flo-rence Italy Association for Computational Linguis-tics

John D Lafferty Andrew McCallum and FernandoC N Pereira 2001 Conditional random fieldsProbabilistic models for segmenting and labeling se-quence data In Proceedings of the Eighteenth Inter-national Conference on Machine Learning (ICML2001) Williams College Williamstown MA USAJune 28 - July 1 2001 pages 282ndash289 MorganKaufmann

Guokun Lai Qizhe Xie Hanxiao Liu Yiming Yangand Eduard Hovy 2017 RACE Large-scale ReAd-ing comprehension dataset from examinations InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing pages785ndash794 Copenhagen Denmark Association forComputational Linguistics

Zhen-Zhong Lan Mingda Chen Sebastian Good-man Kevin Gimpel Piyush Sharma and RaduSoricut 2019 Albert A lite bert for self-supervised learning of language representationsArXiv abs190911942

Shiyang Li Jianshu Chen and Dian Yu 2019 Teach-ing pretrained models with commonsense reason-ing A preliminary kb-based approach CoRRabs190909743

Yang Li and Peter Clark 2015 Answering elementaryscience questions by constructing coherent scenesusing background knowledge In Proceedings ofthe 2015 Conference on Empirical Methods in Nat-ural Language Processing pages 2007ndash2012 Lis-bon Portugal Association for Computational Lin-guistics

Bill Yuchen Lin Xinyue Chen Jamin Chen and Xi-ang Ren 2019 KagNet Knowledge-aware graphnetworks for commonsense reasoning In Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) pages 2829ndash2839 HongKong China Association for Computational Lin-guistics

Liyuan Liu Haoming Jiang Pengcheng He WeizhuChen Xiaodong Liu Jianfeng Gao and Jiawei Han2019a On the variance of the adaptive learning rateand beyond CoRR abs190803265

Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019bRoberta A robustly optimized BERT pretraining ap-proach CoRR abs190711692

Shangwen Lv Daya Guo Jingjing Xu Duyu TangNan Duan Ming Gong Linjun Shou Daxin JiangGuihong Cao and Songlin Hu 2019 Graph-based reasoning over heterogeneous external knowl-edge for commonsense question answering CoRRabs190905311

Kaixin Ma Jonathan Francis Quanyang Lu Eric Ny-berg and Alessandro Oltramari 2019 Towardsgeneralizable neuro-symbolic systems for common-sense question answering In Proceedings of theFirst Workshop on Commonsense Inference in Natu-ral Language Processing pages 22ndash32 Hong KongChina Association for Computational Linguistics

Todor Mihaylov Peter Clark Tushar Khot and AshishSabharwal 2018 Can a suit of armor conduct elec-tricity a new dataset for open book question an-swering In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processingpages 2381ndash2391 Brussels Belgium Associationfor Computational Linguistics

Todor Mihaylov and Anette Frank 2018 Knowledge-able reader Enhancing cloze-style reading compre-hension with external commonsense knowledge InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 821ndash832 Melbourne AustraliaAssociation for Computational Linguistics

Giannis Nikolentzos George Dasoulas and MichalisVazirgiannis 2019 k-hop graph neural networksCoRR abs190706051

Xiaoman Pan Kai Sun Dian Yu Heng Ji and Dong Yu2019 Improving question answering with externalknowledge CoRR abs190200993

Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2019 Exploring the limitsof transfer learning with a unified text-to-text trans-former

Adam Santoro David Raposo David G T BarrettMateusz Malinowski Razvan Pascanu Peter WBattaglia and Tim Lillicrap 2017 A simple neu-ral network module for relational reasoning In Ad-vances in Neural Information Processing Systems30 Annual Conference on Neural Information Pro-cessing Systems 2017 4-9 December 2017 LongBeach CA USA pages 4967ndash4976

Maarten Sap Hannah Rashkin Derek Chen RonanLe Bras and Yejin Choi 2019 Social IQa Com-monsense reasoning about social interactions InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the

9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) pages 4463ndash4473 Hong Kong China Association for Computa-tional Linguistics

Michael Sejr Schlichtkrull Thomas N Kipf PeterBloem Rianne van den Berg Ivan Titov and MaxWelling 2018 Modeling relational data with graphconvolutional networks In The Semantic Web - 15thInternational Conference ESWC 2018 HeraklionCrete Greece June 3-7 2018 Proceedings volume10843 of Lecture Notes in Computer Science pages593ndash607 Springer

Robyn Speer Joshua Chin and Catherine Havasi 2017Conceptnet 55 An open multilingual graph of gen-eral knowledge In Proceedings of the Thirty-FirstAAAI Conference on Artificial Intelligence Febru-ary 4-9 2017 San Francisco California USApages 4444ndash4451 AAAI Press

Alon Talmor Jonathan Herzig Nicholas Lourie andJonathan Berant 2019 CommonsenseQA A ques-tion answering challenge targeting commonsenseknowledge In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4149ndash4158 Minneapolis Minnesota Associ-ation for Computational Linguistics

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez LukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems 30 Annual Conference on NeuralInformation Processing Systems 2017 4-9 Decem-ber 2017 Long Beach CA USA pages 5998ndash6008

Petar Velickovic Guillem Cucurull Arantxa CasanovaAdriana Romero Pietro Lio and Yoshua Bengio2018 Graph attention networks In 6th Inter-national Conference on Learning RepresentationsICLR 2018 Vancouver BC Canada April 30 - May3 2018 Conference Track Proceedings OpenRe-viewnet

Shuohang Wang and Jing Jiang 2016 Learning nat-ural language inference with LSTM In Proceed-ings of the 2016 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics Human Language Technologies pages1442ndash1451 San Diego California Association forComputational Linguistics

Xiaoyan Wang Pavan Kapanipathi Ryan MusaMo Yu Kartik Talamadupula Ibrahim AbdelazizMaria Chang Achille Fokoue Bassem MakniNicholas Mattei and Michael Witbrock 2019 Im-proving natural language inference using externalknowledge in the science questions domain In TheThirty-Third AAAI Conference on Artificial Intelli-gence AAAI 2019 The Thirty-First Innovative Ap-plications of Artificial Intelligence Conference IAAI2019 The Ninth AAAI Symposium on Educational

Advances in Artificial Intelligence EAAI 2019 Hon-olulu Hawaii USA January 27 - February 1 2019pages 7208ndash7215 AAAI Press

Dirk Weissenborn Tomas Kocisky and Chris Dyer2017 Dynamic integration of background knowl-edge in neural nlu systems arXiv preprintarXiv170602596

An Yang Quan Wang Jing Liu Kai Liu Yajuan LyuHua Wu Qiaoqiao She and Sujian Li 2019 En-hancing pre-trained language representations withrich knowledge for machine reading comprehensionIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2346ndash2357 Florence Italy Association for Compu-tational Linguistics

Bishan Yang and Tom Mitchell 2017 Leveragingknowledge bases in LSTMs for improving machinereading In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1 Long Papers) pages 1436ndash1446 Van-couver Canada Association for Computational Lin-guistics

Zhi-Xiu Ye Qian Chen Wen Wang and Zhen-HuaLing 2019 Align mask and select A sim-ple method for incorporating commonsense knowl-edge into language representation models CoRRabs190806725

Yuyu Zhang Hanjun Dai Kamil Toraman andLe Song 2018 Kgˆ2 Learning to reason scienceexam questions with contextual knowledge graphembeddings CoRR abs180512393

Chen Zhu Yu Cheng Zhe Gan Siqi Sun Tom Gold-stein and Jingjing Liu 2020 Freelb Enhanced ad-versarial training for natural language understandingIn International Conference on Learning Represen-tations

A Merging Types of Relations inConceptNet

Relation Merged Relation

AtLocationAtLocation

LocatedNear

CausesCausesCausesDesire

MotivatedByGoal

AntonymAntonym

DistinctFrom

HasSubevent

HasSubevent

HasFirstSubeventHasLastSubeventHasPrerequisite

EntailsMannerOf

IsAIsAInstanceOf

DefinedAs

PartOfPartOf

HasA

RelatedToRelatedToSimilarTo

Synonym

Table 7 Relations in ConceptNet that are being mergedin pre-processing RelationX indicates the reverse re-lation of RelationX

We merge relations that are close in semanticsas well as in the general usage of triple instancesin ConceptNet

B Dataset Split Specifications

Train Dev Test

CommonsenseQA (OF) 9 741 1 221 1 140CommonsenseQA (IH) 8 500 1 221 1 241OpenbookQA 4 957 500 500

Table 8 Numbers of instances in different datasetsplits

CommonsenseQA 7 and OpenbookQA 8 all havetheir leaderboards with training and development

7httpswwwtau-nlporgcommonsenseqa8httpsleaderboardallenaiorgopen_

book_qasubmissionspublic

set publicly available As the ground truth labelsfor CommonsenseQA are not readily available formodel analysis we take 1241 examples from of-ficial training examples as our in-house test ex-amples and regard the remaining 8500 ones asour in-house training examples (CommonsenseQA(IH))

C Implementation Details

CommonsenseQA OpenbookQA

BERT-BASE 3 times 10minus5 -

BERT-LARGE 2 times 10minus5 -

ROBERTA-LARGE 1 times 10minus5

1 times 10minus5

Table 9 Learning rate for text encoders on differentdatasets

CommonsenseQA OpenbookQA

RN 3 times 10minus4

3 times 10minus4

RGCN 1 times 10minus3

1 times 10minus3

GconAttn 3 times 10minus4

1 times 10minus4

MHGRN 1 times 10minus3

1 times 10minus3

Table 10 Learning rate for graph encoders on differentdatasets

Param

RN 399KRGCN 365KGconAttn 453KMHGRN 544K

Table 11 Numbers of parameters of different graph en-coders

Our models are implemented in PyTorch We usecross-entropy loss and RAdam (Liu et al 2019a)optimizer We find it beneficial to use separatelearning rates for the text encoder and the graph en-coder We tune learning rates for text encoders andgraph encoders on two datasets We first fine-tuneROBERTA-LARGE BERT-LARGE BERT-BASE

on CommonsenseQA and ROBERTA-LARGE onOpenbookQA respectively and choose a dataset-specific learning rate from 1times10

minus5 2times10

minus5 3times

10minus5 6 times 10

minus5 1 times 10

minus4 for each text encoderbased on the best performance on development setas listed in Table 9 We report the performance ofthese fine-tuned text encoders and also adopt theirdataset-specific optimal learning rates in joint train-ing with graph encoders For models that involve

KG the learning rate of their graph encoders arechosen from 1 times 10

minus4 3 times 10

minus4 1 times 10

minus3 3 times

10minus3 based on their best development set perfor-

mance with ROBERTA-LARGE as the text encoderWe report the optimal learning rates for graph en-coders in Table 10 In training we set the max-imum input sequence length to text encoders to64 batch size to 32 and perform early stoppingAristoRoBERTaV7+MHGRN is the only exceptionIn order to host fair comparison we follow Aris-toRoBERTaV7 and set the batch size to 16 maxinput sequence length to 256 and choose a decoderlearning rate from 1 times 10

minus3 2 times 10

minus5For the input node features we first use tem-

plates to turn knowledge triples in ConceptNet intosentences and feed them into pre-trained BERT-LARGE obtaining a sequence of token embeddingsfrom the last layer of BERT-LARGE for each tripleFor each entity in ConceptNet we perform meanpooling over the tokens of the entityrsquos occurrencesacross all the sentences to form a 1024d vector asits corresponding node feature We use this set offeatures for all our implemented models

We use 2-layer RGCN and single-layer MHGRNacross our experiments

The numbers of parameter for each graph en-coder are listed in Table 11

D Dynamic Programming Algorithm forEq 7

To show that multi-hop message passing can becomputed in linear time we observe that Eq 7 canbe re-written in matrix form

Zk= (Dk)minus1 sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FXW1r1

⊤⋯W

krk

sdotWk+10

⊤⋯W

K0

⊤ (1 le k le K) (13)

where G = diag(exp([g(φ(v1) s) g(φ(vn)s)]) (F is similarly defined) Ar is the adjacencymatrix for relation r and D

k is defined as follows

Dk= diag( sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FX1) (1 le k le K) (14)

Using this matrix formulation we can compute Eq7 using dynamic programming

Algorithm 1 Dynamic programming algorithm formulti-hop message passingInput sXAr(1 le r le m)W t

r (r isin R 1 le t le

k)F G δ τOutput Z

1 WKlarr I

2 for k larr K minus 1 to 1 do3 W

klarrW

k+10 W

k+1

4 end for5 for r isin R do6 Mr larr FX7 end for8 for k larr 1 to K do9 if k gt 1 then

10 for r isin R do11 M

primer larr e

δ(rs)ArsumrprimeisinR e

τ(rprimers)MrprimeW

kr

12 end for13 for r isin R do14 Mr larrM

primer

15 end for16 else17 for r isin R do18 Mr larr e

δ(rs) sdotArMrWkr

19 end for20 end if21 Z

klarr GsumrisinR MrW

k

22 end for23 Replace W

tr (0 le r le m 1 le t le k) with identity

matrices and X with 1 and re-run line 1 - line 19 tocompute d

1 d

K

24 for k larr 1 to K do25 Z

klarr (diag(dk))minus1

Zk

26 end for27 return Z

1Z

2 Z

k

E Formal Definition of K-hop RN

Definition 1 (K-hop Relation Network) A multi-hop relation network is a function that maps amulti-relational graph to a fixed size vector

KHopRN(G W E H) =K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)sdotW (hjoplus(er1⋯erk)oplushi)(15)

where denotes element-wise product and β(⋯) =1(K∣A∣ sdot ∣(j prime i) isin G ∣ j prime isin Q∣) defines thepooling weights

F Expressing K-hop RN with MultiGRN

Theorem 1 Given any W E H there exists a pa-rameter setting such that the output of the modelbecomes KHopRN(GW E H) for arbitrary G

Proof Suppose W = [W1 W2 W3] whereW1 W3 isin Rd3timesd1 W2 isin Rd3timesd2 ForMHRGN we set the parameters as followsH = HUlowast = [I0] isin R(d1+d2)timesd1 blowast =

[01]⊤ isin Rd1+d2 Wtr = diag(1 oplus er) isin

R(d1+d2)times(d1+d2)(r isin R 1 le t le K)V = W3 isin

Rd3timesd1 Vprime= [W1 W2] isin Rd3times(d1+d2) We dis-

able the relation type attention module and enablemessage passing only from Q to A By furtherchoosing σ as the identity function and perform-ing pooling over A we observe that the output of

MultiGRN becomes

1

∣A∣ sumiisinA

hprimei

=1

∣A∣ (V hi + Vprimezi)

=1

K∣A∣K

sumk=1

(V hi + Vprimezki )

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)(V hi+

VprimeW

krk⋯W

1r1xj)

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ

β(⋯)(V hi + VprimeW

krk

⋯W1r1Uφ(j)hj + V

primeW

krk⋯W

1r1bφ(j))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)(W3hi + W1hj

+ W2(er1 ⋯ erk))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)W (hjoplus

(er1 ⋯ erk)oplus hi)= RN(G W E H)

(16)

Page 4: arXiv:2005.00646v1 [cs.CL] 1 May 20200 80 #node 3.1e3 #path K=2 0 80 #node 6.3e4 K=2 K=3 Figure 2: Number of K-hop relational paths w.r.t. the node count in extracted graphs on Common-senseQA

AttentivePooling

2 Multi-hop Message Passing

Node Features

Text Encodereg BERT

1 Node Type Specific Transform

3 Nonlinear Activation

timesLayer

PlausibilityScore

MLP

120022 119956

AggregateMessages

120630(119947 119955120783 hellip 119955119948 119946)

Structured Relational Att

Learn

Child

Child CapableOf UsedForminus1 AtLocation

Desires UsedForminus1 Synonym

AtLocation Synonym

Sit AtLocationUsedForminus1

Desk AtLocation

hellipMulti-hop Message Passing

Figure 4 Our proposed MHGRN architecturefor relational reasoning MHGRN takes a multi-relational graph G and a (question-answer) statementvector s as input and outputs a scalar that represent theplausibility score of this statement

41 MHGRN Model ArchitectureWe follow the GNN framework introduced in sect3where node features can be initialized with pre-trained weights (details in Appendix C) Here wefocus on the computation of node embeddings

Type-Specific Transformation To make ourmodel aware of the node type φ we first performnode type specific linear transformation on the in-put node features

xi = Uφ(i)hi + bφ(i) (5)

where the learnable parameters U and b are specificto the type of node i

Multi-Hop Message Passing As mentioned be-fore our motivation is to endow GNNs with thecapability of directly modeling paths To this endwe propose to pass messages directly over all therelational paths of lengths up to K The set of validk-hop relational paths is defined as

Φk = (j r1 rk i) ∣ (j r1 j1)⋯ (jkminus1 rk i) isin E (1 le k le K) (6)

We perform k-hop (1 le k le K) message passingover these paths which is a generalization of the

layer In practice multiple layers are stacked to enable mes-sage passing from multi-hop neighbors

single-hop message passing in RGCNs (see Eq 2)

zki = sum

(jr1rki)isinΦk

α(j r1 rk i)dki sdotWK0

⋯Wk+10 W

krk⋯W

1r1xj (1 le k le K) (7)

where the Wtr (1 le t le K 0 le r le m) matrices

are learnable3 α(j r1 rk i) is an attentionscore elaborated in sect42 and d

ki = sum(j⋯i)isinΦk

α(j⋯i)is the normalization factor The W k

rk⋯W1r1 ∣ 1 le

r1 rk le m matrices can be interpreted as thelow rank approximation of a mtimes⋯timesmktimesdtimesdtensor that assigns a separate transformation foreach k-hop relation where d is the dim of xi

Incoming messages from paths of differentlengths are aggregated via attention mecha-nism (Vaswani et al 2017)

zi =K

sumk=1

softmax(bilinear(s zki )) sdot zki (8)

Non-linear Activation Finally we apply shortcutconnection and nonlinear activation to obtain theoutput node embeddings

hprimei = σ (V hi + V

primezi) (9)

where V and Vprime are learnable model parameters

and σ is a non-linear activation function

42 Structured Relational AttentionHere we work towards effectively parameterizingthe attention score α(j r1 rk i) in Eq 7 for allk-hop paths without introducing O(mk) parametersWe first regard it as the probability of a relationsequence (φ(j) r1 rk φ(i)) conditioned on s

α(j r1 rk i) = p (φ(j) r1 rk φ(i) ∣ s) (10)

which can naturally be modeled by a probabilisticgraphical model such as conditional random field(Lafferty et al 2001)

p (⋯ ∣ s)prop exp(f(φ(j) s) +k

sumt=1

δ(rt s)

+kminus1

sumt=1

τ(rt rt+1) + g(φ(i) s))

∆= β(r1 rk s)IacuteOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveNtildeOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveIuml

Relation Type Attention

sdot γ(φ(j) φ(i) s)IacuteOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveNtildeOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveIumlNode Type Attention

(11)

3W

t0(0 le t le K) are introduced as padding matrices

so that K transformations are applied regardless of k thusensuring comparable scale of zki across different k

Model Time Space

G is a dense graph

K-hop KagNet O (mKnK+1

K) O (mKnK+1

K)K-layer RGCN O (mn2

K) O (mnK)MHGRN O (m2

n2K) O (mnK)

G is a sparse graph with maximum node degree ∆ ≪ n

K-hop KagNet O (mKnK∆

K) O (mKnK∆

K)K-layer RGCN O (mnK∆) O (mnK)MHGRN O (m2

nK∆) O (mnK)

Table 2 Computation complexity of different K-hopreasoning models on a densesparse multi-relationalgraph with n nodes and m relation types Despite thequadratic complexity wrt m MHGRNrsquos time cost issimilar to RGCN on GPUs with parallelizable matrixmultiplications (cf Fig 7)

where f(sdot) δ(sdot) and g(sdot) are parameterized bytwo-layer MLPs and τ(sdot) by a transition matrixof shape m timesm Intuitively β(sdot) models the im-portance of a k-hop relation while γ(sdot) models theimportance of messages from node type φ(j) toφ(i) (eg the model can learn to pass messagesonly from question entities to answer entities)

Our model scores a k-hop relation by decom-posing it into both context-aware single-hop re-lations (modeled by δ) and two-hop relations(modeled by τ ) We argue that τ is indis-pensable without which the model may assignhigh importance to illogical multi-hop relations(eg [AtLocation CapableOf]) or noisy re-lations (eg [RelatedTo RelatedTo])

43 Computation Complexity Analysis

Although message passing process in Eq 7 andattention module in Eq11 handles potentially ex-ponential number of paths it can be computed inlinear time with dynamic programming (see Ap-pendix D) As summarized in Table 2 time com-plexity and space complexity of MHGRN on asparse graph are both linear wrt either the maxi-mum path length K or the number of nodes n

44 Expressive Power of MHGRN

In addition to efficiency and scalability we now dis-cuss the modeling capacity of MHGRN With themessage passing formulation and relation-specifictransformations it is by nature the generalizationof RGCN It is also capable of directly modelingpaths making it interpretable as are path-basedmodels like RN and KagNet To show this we

first generalize RN (Eq 3) to the multi-hop set-ting and introduce K-hop RN (formal definition inAppendix E) which models multi-hop relation asthe composition of single-hop relations We showthat MHGRN is capable of representingK-hop RN(proof in Appendix F)

45 Learning Inference and Path DecodingWe now discuss the learning and inference processof MHGRN instantiated for QA tasks Followingthe problem formulation in sect2 we aim to deter-mine the plausibility of an answer option a isin Cgiven the question q with the information fromboth text s and graph G We first obtain the graphrepresentation g by performing attentive poolingover the output node embeddings of answer entitieshprimei ∣ i isin A Next we concatenate it with the textrepresentation s and compute the plausibility scoreby ρ(qa) = MLP(soplus g)

During training we maximize the plausibilityscore of the correct answer a by minimizing thecross-entropy loss

L = EqaC [minus logexp(ρ(q a))

sumaisinC exp(ρ(qa))] (12)

The whole model is trained end-to-end jointly withthe text encoder (eg RoBERTa)

During inference we predict the most plau-sible answer by argmaxaisinC ρ(qa) Addition-ally we can decode a reasoning path as evidencefor model predictions endowing our model withthe interpretability enjoyed by path-based mod-els Specifically we first determine the answerentity ilowast with the highest score in the pooling layerand the path length klowast with the highest score inEq 8 Then the reasoning path is decoded byargmax α(j r1 rklowast i

lowast) which can be com-puted in linear time using dynamic programming

5 Experimental SetupWe introduce how we construct G (sect51) thedatasets (sect52) as well as the baseline methods(sect53) Appendix C shows more implementationand experimental details for reproducibility

51 Extracting G from External KGWe use ConceptNet (Speer et al 2017) a general-domain knowledge graph as our external KG to testmodelsrsquo ability to harness structured knowledgesource Following KagNet (Lin et al 2019) wemerge relation types to increase graph density andadd reverse relations to construct a multi-relational

Methods BERT-Base BERT-Large RoBERTa-Large

IHdev-Acc() IHtest-Acc() IHdev-Acc() IHtest-Acc() IHdev-Acc() IHtest-Acc()

wo KG 5731 (plusmn107) 5347 (plusmn087) 6106 (plusmn085) 5539 (plusmn040) 7307 (plusmn045) 6869(plusmn056)

RGCN (Schlichtkrull et al 2018) 5694 (plusmn038) 5450 (plusmn056) 6298 (plusmn082) 5713 (plusmn036) 7269 (plusmn019) 6841 (plusmn066)GconAttn (Wang et al 2019) 5727 (plusmn070) 5484 (plusmn088) 6317 (plusmn018) 5736 (plusmn090) 7261( plusmn039) 6859 (plusmn096)KagNetdagger (Lin et al 2019) 5557 5619 6235 5716 - -RN (1-hop) 5827 (plusmn022) 5620 (plusmn045) 6304 (plusmn058) 5846 (plusmn071) 7457 (plusmn091) 6908 (plusmn021)RN (2-hop) 5981 (plusmn076) 5661 (plusmn068) 6336 (plusmn026) 5892 (plusmn014) 7365 (plusmn309) 6959 (plusmn380)

MHGRN 6036 (plusmn023) 5723 (plusmn082) 6329(plusmn051) 6059 (plusmn058) 7445 (plusmn010) 7111 (plusmn081)

Table 3 Performance comparison on CommonsenseQA in-house split We report in-house Dev (IHdev) andTest (IHtest) accuracy (mean and standard deviation of four runs) using the data split of Lin et al (2019) onCommonsenseQA dagger indicates reported results in its paper

Methods Single Ensemble

UnifiedQAdagger (Khashabi et al 2020) 791 -

RoBERTadagger 721 725RoBERTa + KEDGNdagger 725 744RoBERTa + KEdagger 733 -RoBERTa + HyKAS 20dagger (Ma et al 2019) 732 -RoBERTa + FreeLBdagger(Zhu et al 2020) 722 731XLNet + DREAMdagger 669 733XLNet + GRdagger (Lv et al 2019) 753 -ALBERTdagger (Lan et al 2019) - 765

RoBERTa + MHGRN (K = 2) 754 765

Table 4 Performance comparison on official test ofCommonsenseQA with leaderboard SoTAs4 (accuracyin ) dagger indicates reported results on leaderboard Uni-fiedQA uses T5-11B as text encoder whose number ofparameters is about 30 times more than other models

graph with 34 relation types (details in AppendixA) To extract an informative contextualized graphG from KG we recognize entity mentions in s andlink them to entities in ConceptNet with which weinitialize our node set V We then add to V all theentities that appear in any two-hop paths betweenpairs of mentioned entities Unlike KagNet we donot perform any pruning but instead reserve all theedges between nodes in V forming our G

52 Datasets

We evaluate models on two multiple-choice ques-tion answering datasets CommonsenseQA andOpenbookQA Both require world knowledge be-yond textual understanding to perform well

CommonsenseQA (Talmor et al 2019) neces-sitates various commonsense reasoning skills Thequestions are created with entities from ConceptNetand they are designed to probe latent compositionalrelations between entities in ConceptNet

OpenBookQA (Mihaylov et al 2018) provides

4Models based on ConceptNet are no longer shown on theleaderboard and we got our results from the organizers

elementary science questions together with an openbook of science facts This dataset also probesgeneral common sense beyond the provided facts

53 Compared Methods

We implement both knowledge-agnostic fine-tuning of pre-trained LMs and models that incor-porate KG as external sources as our baselinesAdditionally we directly compare our model withthe results from corresponding leaderboard Thesemethods typically leverage textual knowledge orextra training data as opposed to external KG Inall our implemented models we use pre-trainedLMs as text encoders for s for fair comparisonWe do compare our models with those (Ma et al2019 Lv et al 2019 Khashabi et al 2020) aug-mented by other text-form external knowledge (egWikipedia) although we stick to our focus of en-coding structured KG

Specifically we fine-tune BERT-BASE BERT-LARGE (Devlin et al 2019) and ROBERTA (Liuet al 2019b) for multiple-choice questions Wetake RGCN (Eq 2 in sect3) RN5 (Eq 3 in sect3)KagNet (Eq 4 in sect3) and GconAttn (Wanget al 2019) as baselines GconAttn generalizesmatch-LSTM (Wang and Jiang 2016) and achievessuccess in language inference tasks

6 Results and Discussions

In this section we present the results of our modelsin comparison with baselines as well as methodson the leaderboards for both CommonsenseQA andOpenbookQA We also provide analysis of modelsrsquocomponents and characteristics

5We use mean pooling for 1-hop RN and attentive poolingfor 2-hop RN (detailed in Appendix E)

Methods Dev () Test ()

T5-3Bdagger (Raffel et al 2019) - 8320UnifiedQAdagger (Khashabi et al 2020) - 8720

RoBERTa-Large (wo KG) 6676 (plusmn114) 6480 (plusmn237)+ RGCN 6465 (plusmn196) 6245 (plusmn157)+ GconAttn 6430 (plusmn099) 6190 (plusmn244)+ RN (1-hop) 6485 (plusmn111) 6365 (plusmn231)+ RN (2-hop) 6700 (plusmn071) 6520 (plusmn118)+ MHGRN (K = 3) 6810 (plusmn102) 6685 (plusmn119)

AristoRoBERTaV7dagger 792 778+ MHGRN (K = 3) 786 806

Table 5 Performance comparison on OpenbookQAdagger indicates reported results on leaderboard T5-3B is 8times larger than our models UnifiedQA is 30x larger

61 Main ResultsFor CommonsenseQA (Table 3) we first use thein-house data split (IH) of Lin et al (2019) (cfAppendix B) to compare our models with imple-mented baselines This is different from the offi-cial split used in the leaderboard methods Almostall KG-augmented models achieve performancegain over vanilla pre-trained LMs demonstratingthe value of external knowledge on this dataset Ad-ditionally we evaluate our MHGRN (with text en-coder being ROBERTA-LARGE) on official splitOF (Table 4) for fair comparison with other meth-ods on leaderboard in both single-model settingand ensemble-model setting With backbone be-ing T5 (Raffel et al 2019) UnifiedQA (Khashabiet al 2020) tops the leaderboard Consideringits training cost we do not intend to compare ourROBERTA-based model with it We achieve thebest performances among other models

For OpenbookQA (Table 5) we use official splitand build models with ROBERTA-LARGE as textencoder MHGRN surpasses all implemented base-lines with an absolute increase of sim2 on TestAlso as our approach is naturally compatible withthe methods that utilize textual knowledge or ex-tra data because in our paradigm the encodingof textual statement and graph are structurally-decoupled (Fig 3) To empirically show MHGRNcan bring gain over textual-knowledge empow-ered systems we replace our text encoder withAristoRoBERTaV76 and fine-tune our MHGRNupon OpenbookQA Empirically MHGRN contin-ues to bring benefit to strong-performing textual-knowledge empowered systems This indicatestextual knowledge and structured knowledge canpotentially be complementary

6httpsleaderboardallenaiorgopen_book_qasubmissionblcp1tu91i4gm0vf484g

Methods IHdev-Acc ()

MHGRN (K = 3) 7445 (plusmn010)- Type-specific transformation (sect41) 7316 (plusmn028)- Structured relational attention (sect42) 7326 (plusmn031)- Relation type attention (sect42) 7355 (plusmn068)- Node type attention (sect42) 7392 (plusmn065)

Table 6 Ablation study on model components (re-moving one component each time) using ROBERTA-LARGE as the text encoder We report the IHdev ac-curacy on CommonsenseQA

62 Performance Analysis

Ablation Study on Model Components Asshown in Table 6 disabling type-specific trans-formation results in sim 13 drop in performancedemonstrating the need for distinguishing nodetypes for QA tasks Our structured relational at-tention mechanism is also critical with its twosub-components contributing almost equally

Impact of the Amount of Training Data Weuse different fractions of training data of Common-senseQA and report results of fine-tuning text en-coders alone and jointly training text encoder andgraph encoder in Fig 5 Regardless of training datafraction our model shows consistently more per-formance improvement over knowledge-agnosticfine-tuning compared with the other graph encod-ing methods indicating MHGRNrsquos complementarystrengths to text encoders

20 40 60 80 100Percentage of Training Data

45

50

55

60

65

Accu

racy

()

MultiGRNKVMGconAttnwo KG

Figure 5 Performance change (accuracy in ) wrtthe amounts of training data on CommonsenseQA IHT-est set (same as Lin et al (2019))

Impact of Number of Hops (K) We investigatethe impact of hyperparameter K for MHGRN byits performance on CommonsenseQA (Fig 6) Theincrease ofK continues to bring benefits untilK =

4 However performance begins to drop whenK gt 3 This might be attributed to exponentialnoise in longer relational paths in knowledge graph

65 70 75Accuracy ()

K=1

K=2

K=3

K=4

K=5

Reasoning Hops

7324

7393

7465

7365

7416

Figure 6 Effect ofK in MHGRN We show IHDev ac-curacy of MHGRN on CommonsenseQA wrt hops

2 4 6 800

05

10

15

20

sba

tch

K Reasoning Hops

RGCNMHGRN

Figure 7 Analysis of model scalability Comparisonof per-batch training efficiency wrt hops K

63 Model ScalabilityFig 7 presents the computation cost of MHGRNand RGCN (measured by training time) Both growlinearly wrt K Although the theoretical complex-ity of MultiRGN ism times that of RGCN the ratioof their empirical cost only approaches 2 demon-strating that our model can be better parallelized

64 Model InterpretabilityWe can analyze our modelrsquos reasoning process bydecoding the reasoning path using the method de-scribed in sect45 Fig 8 shows two examples fromCommonsenseQA where our model correctly an-swers the questions and provides reasonable pathevidences In the example on the left the modellinks question entities and answer entity in a chainto support reasoning while the example on the rightprovides a case where our model leverage unmen-tioned entities to bridge the reasoning gap betweenquestion entity and answer entities in a way that iscoherent with the latent relation between CHAPEL

and the desired answer in the question

7 Related WorkKnowledge-Aware Methods for NLP Variouswork have investigated the potential to empowerNLP models with external knowledge Many at-tempt to extract structured knowledge either in theform of nodes (Yang and Mitchell 2017 Wanget al 2019) triples (Weissenborn et al 2017

Where is known for a multitude of wedding chapels

A town B texasC city D church building E Nevada

RenoChapel

PartOf

AtLocation

UsedForminus1 Synonym

AtLocation Synonym

Neveda

Baseball

Play_Baseball

HasProperty

UsedForminus1

Fun_to_Play

Why do parents encourage their kids to play baseball

A round B cheap C break window D hardE fun to play Fun_to_Play

HasProperty

Reno

Chapel

AtLocation

Neveda

PartOf

MultiGRNFigure 8 Case study on model interpretability Wepresent two sample questions from CommonsenseQAwith the reasoning paths output by MHGRN

Mihaylov and Frank 2018) paths (Bauer et al2018 Kundu et al 2019 Lin et al 2019) or sub-graphs (Li and Clark 2015) and encode them toaugment textual understanding

Recent success of pre-trained LMs motivatesmany (Pan et al 2019 Ye et al 2019 Zhang et al2018 Li et al 2019 Banerjee et al 2019) to probeLMsrsquo potential as latent knowledge bases This lineof work turn to textual knowledge (eg Wikipedia)to directly impart knowledge to pre-trained LMsThey generally fall into two paradigms 1) Fine-tuning LMs on large-scale general-domain datasets(eg RACE (Lai et al 2017)) or on knowledge-richtext 2) Providing LMs with evidence via informa-tion retrieval techniques However these modelscannot provide explicit reasoning and evidencethus hardly trustworthy They are also subject tothe availability of in-domain datasets and maxi-mum input token of pre-trained LMs

Neural Graph Encoding Graph Attention Net-works (GAT) (Velickovic et al 2018) incorpo-rates attention mechanism in feature aggregationRGCN (Schlichtkrull et al 2018) proposes rela-tional message passing which makes it applicableto multi-relational graphs However they only per-form single-hop message passing and cannot beinterpreted at path level Other work (Abu-El-Haijaet al 2019 Nikolentzos et al 2019) aggregate fora node its K-hop neighbors based on node-wisedistances but they are designed for non-relationalgraphs MHGRN addresses these issues by rea-soning on multi-relational graphs and being inter-pretable via maintaining paths as reasoning chains

8 Conclusion

We present a principled scalable method MHGRNthat can leverage general knowledge via multi-hopreasoning over interpretable structures (eg Con-ceptNet) The proposed MHGRN generalizes andcombines the advantages of GNNs and path-basedreasoning models It explicitly performs multi-hop

relational reasoning and is empirically shown tooutperform existing methods with superior scal-ablility and interpretability

ReferencesSami Abu-El-Haija Bryan Perozzi Amol Kapoor

Nazanin Alipourfard Kristina Lerman Hrayr Haru-tyunyan Greg Ver Steeg and Aram Galstyan 2019Mixhop Higher-order graph convolutional architec-tures via sparsified neighborhood mixing In Pro-ceedings of the 36th International Conference onMachine Learning ICML 2019 9-15 June 2019Long Beach California USA volume 97 of Pro-ceedings of Machine Learning Research pages 21ndash29 PMLR

Pratyay Banerjee Kuntal Kumar Pal Arindam Mi-tra and Chitta Baral 2019 Careful selection ofknowledge to solve open book question answeringIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages6120ndash6129 Florence Italy Association for Compu-tational Linguistics

Lisa Bauer Yicheng Wang and Mohit Bansal 2018Commonsense for generative multi-hop question an-swering tasks In Proceedings of the 2018 Confer-ence on Empirical Methods in Natural LanguageProcessing Brussels Belgium October 31 - Novem-ber 4 2018 pages 4220ndash4230 Association for Com-putational Linguistics

Peter Clark Isaac Cowhey Oren Etzioni Tushar KhotAshish Sabharwal Carissa Schoenick and OyvindTafjord 2018 Think you have solved questionanswering try arc the AI2 reasoning challengeCoRR abs180305457

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

Justin Gilmer Samuel S Schoenholz Patrick F RileyOriol Vinyals and George E Dahl 2017 Neuralmessage passing for quantum chemistry In Pro-ceedings of the 34th International Conference onMachine Learning ICML 2017 Sydney NSW Aus-tralia 6-11 August 2017 volume 70 of Proceedingsof Machine Learning Research pages 1263ndash1272PMLR

Daniel Khashabi Tushar Khot Ashish SabharwalOyvind Tafjord Peter Clark and Hannaneh Ha-jishirzi 2020 Unifiedqa Crossing format bound-aries with a single qa system

Thomas N Kipf and Max Welling 2017 Semi-supervised classification with graph convolutionalnetworks In 5th International Conference on Learn-ing Representations ICLR 2017 Toulon FranceApril 24-26 2017 Conference Track ProceedingsOpenReviewnet

Souvik Kundu Tushar Khot Ashish Sabharwal andPeter Clark 2019 Exploiting explicit paths formulti-hop reading comprehension In Proceedingsof the 57th Annual Meeting of the Association forComputational Linguistics pages 2737ndash2747 Flo-rence Italy Association for Computational Linguis-tics

John D Lafferty Andrew McCallum and FernandoC N Pereira 2001 Conditional random fieldsProbabilistic models for segmenting and labeling se-quence data In Proceedings of the Eighteenth Inter-national Conference on Machine Learning (ICML2001) Williams College Williamstown MA USAJune 28 - July 1 2001 pages 282ndash289 MorganKaufmann

Guokun Lai Qizhe Xie Hanxiao Liu Yiming Yangand Eduard Hovy 2017 RACE Large-scale ReAd-ing comprehension dataset from examinations InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing pages785ndash794 Copenhagen Denmark Association forComputational Linguistics

Zhen-Zhong Lan Mingda Chen Sebastian Good-man Kevin Gimpel Piyush Sharma and RaduSoricut 2019 Albert A lite bert for self-supervised learning of language representationsArXiv abs190911942

Shiyang Li Jianshu Chen and Dian Yu 2019 Teach-ing pretrained models with commonsense reason-ing A preliminary kb-based approach CoRRabs190909743

Yang Li and Peter Clark 2015 Answering elementaryscience questions by constructing coherent scenesusing background knowledge In Proceedings ofthe 2015 Conference on Empirical Methods in Nat-ural Language Processing pages 2007ndash2012 Lis-bon Portugal Association for Computational Lin-guistics

Bill Yuchen Lin Xinyue Chen Jamin Chen and Xi-ang Ren 2019 KagNet Knowledge-aware graphnetworks for commonsense reasoning In Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) pages 2829ndash2839 HongKong China Association for Computational Lin-guistics

Liyuan Liu Haoming Jiang Pengcheng He WeizhuChen Xiaodong Liu Jianfeng Gao and Jiawei Han2019a On the variance of the adaptive learning rateand beyond CoRR abs190803265

Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019bRoberta A robustly optimized BERT pretraining ap-proach CoRR abs190711692

Shangwen Lv Daya Guo Jingjing Xu Duyu TangNan Duan Ming Gong Linjun Shou Daxin JiangGuihong Cao and Songlin Hu 2019 Graph-based reasoning over heterogeneous external knowl-edge for commonsense question answering CoRRabs190905311

Kaixin Ma Jonathan Francis Quanyang Lu Eric Ny-berg and Alessandro Oltramari 2019 Towardsgeneralizable neuro-symbolic systems for common-sense question answering In Proceedings of theFirst Workshop on Commonsense Inference in Natu-ral Language Processing pages 22ndash32 Hong KongChina Association for Computational Linguistics

Todor Mihaylov Peter Clark Tushar Khot and AshishSabharwal 2018 Can a suit of armor conduct elec-tricity a new dataset for open book question an-swering In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processingpages 2381ndash2391 Brussels Belgium Associationfor Computational Linguistics

Todor Mihaylov and Anette Frank 2018 Knowledge-able reader Enhancing cloze-style reading compre-hension with external commonsense knowledge InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 821ndash832 Melbourne AustraliaAssociation for Computational Linguistics

Giannis Nikolentzos George Dasoulas and MichalisVazirgiannis 2019 k-hop graph neural networksCoRR abs190706051

Xiaoman Pan Kai Sun Dian Yu Heng Ji and Dong Yu2019 Improving question answering with externalknowledge CoRR abs190200993

Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2019 Exploring the limitsof transfer learning with a unified text-to-text trans-former

Adam Santoro David Raposo David G T BarrettMateusz Malinowski Razvan Pascanu Peter WBattaglia and Tim Lillicrap 2017 A simple neu-ral network module for relational reasoning In Ad-vances in Neural Information Processing Systems30 Annual Conference on Neural Information Pro-cessing Systems 2017 4-9 December 2017 LongBeach CA USA pages 4967ndash4976

Maarten Sap Hannah Rashkin Derek Chen RonanLe Bras and Yejin Choi 2019 Social IQa Com-monsense reasoning about social interactions InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the

9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) pages 4463ndash4473 Hong Kong China Association for Computa-tional Linguistics

Michael Sejr Schlichtkrull Thomas N Kipf PeterBloem Rianne van den Berg Ivan Titov and MaxWelling 2018 Modeling relational data with graphconvolutional networks In The Semantic Web - 15thInternational Conference ESWC 2018 HeraklionCrete Greece June 3-7 2018 Proceedings volume10843 of Lecture Notes in Computer Science pages593ndash607 Springer

Robyn Speer Joshua Chin and Catherine Havasi 2017Conceptnet 55 An open multilingual graph of gen-eral knowledge In Proceedings of the Thirty-FirstAAAI Conference on Artificial Intelligence Febru-ary 4-9 2017 San Francisco California USApages 4444ndash4451 AAAI Press

Alon Talmor Jonathan Herzig Nicholas Lourie andJonathan Berant 2019 CommonsenseQA A ques-tion answering challenge targeting commonsenseknowledge In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4149ndash4158 Minneapolis Minnesota Associ-ation for Computational Linguistics

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez LukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems 30 Annual Conference on NeuralInformation Processing Systems 2017 4-9 Decem-ber 2017 Long Beach CA USA pages 5998ndash6008

Petar Velickovic Guillem Cucurull Arantxa CasanovaAdriana Romero Pietro Lio and Yoshua Bengio2018 Graph attention networks In 6th Inter-national Conference on Learning RepresentationsICLR 2018 Vancouver BC Canada April 30 - May3 2018 Conference Track Proceedings OpenRe-viewnet

Shuohang Wang and Jing Jiang 2016 Learning nat-ural language inference with LSTM In Proceed-ings of the 2016 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics Human Language Technologies pages1442ndash1451 San Diego California Association forComputational Linguistics

Xiaoyan Wang Pavan Kapanipathi Ryan MusaMo Yu Kartik Talamadupula Ibrahim AbdelazizMaria Chang Achille Fokoue Bassem MakniNicholas Mattei and Michael Witbrock 2019 Im-proving natural language inference using externalknowledge in the science questions domain In TheThirty-Third AAAI Conference on Artificial Intelli-gence AAAI 2019 The Thirty-First Innovative Ap-plications of Artificial Intelligence Conference IAAI2019 The Ninth AAAI Symposium on Educational

Advances in Artificial Intelligence EAAI 2019 Hon-olulu Hawaii USA January 27 - February 1 2019pages 7208ndash7215 AAAI Press

Dirk Weissenborn Tomas Kocisky and Chris Dyer2017 Dynamic integration of background knowl-edge in neural nlu systems arXiv preprintarXiv170602596

An Yang Quan Wang Jing Liu Kai Liu Yajuan LyuHua Wu Qiaoqiao She and Sujian Li 2019 En-hancing pre-trained language representations withrich knowledge for machine reading comprehensionIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2346ndash2357 Florence Italy Association for Compu-tational Linguistics

Bishan Yang and Tom Mitchell 2017 Leveragingknowledge bases in LSTMs for improving machinereading In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1 Long Papers) pages 1436ndash1446 Van-couver Canada Association for Computational Lin-guistics

Zhi-Xiu Ye Qian Chen Wen Wang and Zhen-HuaLing 2019 Align mask and select A sim-ple method for incorporating commonsense knowl-edge into language representation models CoRRabs190806725

Yuyu Zhang Hanjun Dai Kamil Toraman andLe Song 2018 Kgˆ2 Learning to reason scienceexam questions with contextual knowledge graphembeddings CoRR abs180512393

Chen Zhu Yu Cheng Zhe Gan Siqi Sun Tom Gold-stein and Jingjing Liu 2020 Freelb Enhanced ad-versarial training for natural language understandingIn International Conference on Learning Represen-tations

A Merging Types of Relations inConceptNet

Relation Merged Relation

AtLocationAtLocation

LocatedNear

CausesCausesCausesDesire

MotivatedByGoal

AntonymAntonym

DistinctFrom

HasSubevent

HasSubevent

HasFirstSubeventHasLastSubeventHasPrerequisite

EntailsMannerOf

IsAIsAInstanceOf

DefinedAs

PartOfPartOf

HasA

RelatedToRelatedToSimilarTo

Synonym

Table 7 Relations in ConceptNet that are being mergedin pre-processing RelationX indicates the reverse re-lation of RelationX

We merge relations that are close in semanticsas well as in the general usage of triple instancesin ConceptNet

B Dataset Split Specifications

Train Dev Test

CommonsenseQA (OF) 9 741 1 221 1 140CommonsenseQA (IH) 8 500 1 221 1 241OpenbookQA 4 957 500 500

Table 8 Numbers of instances in different datasetsplits

CommonsenseQA 7 and OpenbookQA 8 all havetheir leaderboards with training and development

7httpswwwtau-nlporgcommonsenseqa8httpsleaderboardallenaiorgopen_

book_qasubmissionspublic

set publicly available As the ground truth labelsfor CommonsenseQA are not readily available formodel analysis we take 1241 examples from of-ficial training examples as our in-house test ex-amples and regard the remaining 8500 ones asour in-house training examples (CommonsenseQA(IH))

C Implementation Details

CommonsenseQA OpenbookQA

BERT-BASE 3 times 10minus5 -

BERT-LARGE 2 times 10minus5 -

ROBERTA-LARGE 1 times 10minus5

1 times 10minus5

Table 9 Learning rate for text encoders on differentdatasets

CommonsenseQA OpenbookQA

RN 3 times 10minus4

3 times 10minus4

RGCN 1 times 10minus3

1 times 10minus3

GconAttn 3 times 10minus4

1 times 10minus4

MHGRN 1 times 10minus3

1 times 10minus3

Table 10 Learning rate for graph encoders on differentdatasets

Param

RN 399KRGCN 365KGconAttn 453KMHGRN 544K

Table 11 Numbers of parameters of different graph en-coders

Our models are implemented in PyTorch We usecross-entropy loss and RAdam (Liu et al 2019a)optimizer We find it beneficial to use separatelearning rates for the text encoder and the graph en-coder We tune learning rates for text encoders andgraph encoders on two datasets We first fine-tuneROBERTA-LARGE BERT-LARGE BERT-BASE

on CommonsenseQA and ROBERTA-LARGE onOpenbookQA respectively and choose a dataset-specific learning rate from 1times10

minus5 2times10

minus5 3times

10minus5 6 times 10

minus5 1 times 10

minus4 for each text encoderbased on the best performance on development setas listed in Table 9 We report the performance ofthese fine-tuned text encoders and also adopt theirdataset-specific optimal learning rates in joint train-ing with graph encoders For models that involve

KG the learning rate of their graph encoders arechosen from 1 times 10

minus4 3 times 10

minus4 1 times 10

minus3 3 times

10minus3 based on their best development set perfor-

mance with ROBERTA-LARGE as the text encoderWe report the optimal learning rates for graph en-coders in Table 10 In training we set the max-imum input sequence length to text encoders to64 batch size to 32 and perform early stoppingAristoRoBERTaV7+MHGRN is the only exceptionIn order to host fair comparison we follow Aris-toRoBERTaV7 and set the batch size to 16 maxinput sequence length to 256 and choose a decoderlearning rate from 1 times 10

minus3 2 times 10

minus5For the input node features we first use tem-

plates to turn knowledge triples in ConceptNet intosentences and feed them into pre-trained BERT-LARGE obtaining a sequence of token embeddingsfrom the last layer of BERT-LARGE for each tripleFor each entity in ConceptNet we perform meanpooling over the tokens of the entityrsquos occurrencesacross all the sentences to form a 1024d vector asits corresponding node feature We use this set offeatures for all our implemented models

We use 2-layer RGCN and single-layer MHGRNacross our experiments

The numbers of parameter for each graph en-coder are listed in Table 11

D Dynamic Programming Algorithm forEq 7

To show that multi-hop message passing can becomputed in linear time we observe that Eq 7 canbe re-written in matrix form

Zk= (Dk)minus1 sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FXW1r1

⊤⋯W

krk

sdotWk+10

⊤⋯W

K0

⊤ (1 le k le K) (13)

where G = diag(exp([g(φ(v1) s) g(φ(vn)s)]) (F is similarly defined) Ar is the adjacencymatrix for relation r and D

k is defined as follows

Dk= diag( sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FX1) (1 le k le K) (14)

Using this matrix formulation we can compute Eq7 using dynamic programming

Algorithm 1 Dynamic programming algorithm formulti-hop message passingInput sXAr(1 le r le m)W t

r (r isin R 1 le t le

k)F G δ τOutput Z

1 WKlarr I

2 for k larr K minus 1 to 1 do3 W

klarrW

k+10 W

k+1

4 end for5 for r isin R do6 Mr larr FX7 end for8 for k larr 1 to K do9 if k gt 1 then

10 for r isin R do11 M

primer larr e

δ(rs)ArsumrprimeisinR e

τ(rprimers)MrprimeW

kr

12 end for13 for r isin R do14 Mr larrM

primer

15 end for16 else17 for r isin R do18 Mr larr e

δ(rs) sdotArMrWkr

19 end for20 end if21 Z

klarr GsumrisinR MrW

k

22 end for23 Replace W

tr (0 le r le m 1 le t le k) with identity

matrices and X with 1 and re-run line 1 - line 19 tocompute d

1 d

K

24 for k larr 1 to K do25 Z

klarr (diag(dk))minus1

Zk

26 end for27 return Z

1Z

2 Z

k

E Formal Definition of K-hop RN

Definition 1 (K-hop Relation Network) A multi-hop relation network is a function that maps amulti-relational graph to a fixed size vector

KHopRN(G W E H) =K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)sdotW (hjoplus(er1⋯erk)oplushi)(15)

where denotes element-wise product and β(⋯) =1(K∣A∣ sdot ∣(j prime i) isin G ∣ j prime isin Q∣) defines thepooling weights

F Expressing K-hop RN with MultiGRN

Theorem 1 Given any W E H there exists a pa-rameter setting such that the output of the modelbecomes KHopRN(GW E H) for arbitrary G

Proof Suppose W = [W1 W2 W3] whereW1 W3 isin Rd3timesd1 W2 isin Rd3timesd2 ForMHRGN we set the parameters as followsH = HUlowast = [I0] isin R(d1+d2)timesd1 blowast =

[01]⊤ isin Rd1+d2 Wtr = diag(1 oplus er) isin

R(d1+d2)times(d1+d2)(r isin R 1 le t le K)V = W3 isin

Rd3timesd1 Vprime= [W1 W2] isin Rd3times(d1+d2) We dis-

able the relation type attention module and enablemessage passing only from Q to A By furtherchoosing σ as the identity function and perform-ing pooling over A we observe that the output of

MultiGRN becomes

1

∣A∣ sumiisinA

hprimei

=1

∣A∣ (V hi + Vprimezi)

=1

K∣A∣K

sumk=1

(V hi + Vprimezki )

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)(V hi+

VprimeW

krk⋯W

1r1xj)

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ

β(⋯)(V hi + VprimeW

krk

⋯W1r1Uφ(j)hj + V

primeW

krk⋯W

1r1bφ(j))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)(W3hi + W1hj

+ W2(er1 ⋯ erk))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)W (hjoplus

(er1 ⋯ erk)oplus hi)= RN(G W E H)

(16)

Page 5: arXiv:2005.00646v1 [cs.CL] 1 May 20200 80 #node 3.1e3 #path K=2 0 80 #node 6.3e4 K=2 K=3 Figure 2: Number of K-hop relational paths w.r.t. the node count in extracted graphs on Common-senseQA

Model Time Space

G is a dense graph

K-hop KagNet O (mKnK+1

K) O (mKnK+1

K)K-layer RGCN O (mn2

K) O (mnK)MHGRN O (m2

n2K) O (mnK)

G is a sparse graph with maximum node degree ∆ ≪ n

K-hop KagNet O (mKnK∆

K) O (mKnK∆

K)K-layer RGCN O (mnK∆) O (mnK)MHGRN O (m2

nK∆) O (mnK)

Table 2 Computation complexity of different K-hopreasoning models on a densesparse multi-relationalgraph with n nodes and m relation types Despite thequadratic complexity wrt m MHGRNrsquos time cost issimilar to RGCN on GPUs with parallelizable matrixmultiplications (cf Fig 7)

where f(sdot) δ(sdot) and g(sdot) are parameterized bytwo-layer MLPs and τ(sdot) by a transition matrixof shape m timesm Intuitively β(sdot) models the im-portance of a k-hop relation while γ(sdot) models theimportance of messages from node type φ(j) toφ(i) (eg the model can learn to pass messagesonly from question entities to answer entities)

Our model scores a k-hop relation by decom-posing it into both context-aware single-hop re-lations (modeled by δ) and two-hop relations(modeled by τ ) We argue that τ is indis-pensable without which the model may assignhigh importance to illogical multi-hop relations(eg [AtLocation CapableOf]) or noisy re-lations (eg [RelatedTo RelatedTo])

43 Computation Complexity Analysis

Although message passing process in Eq 7 andattention module in Eq11 handles potentially ex-ponential number of paths it can be computed inlinear time with dynamic programming (see Ap-pendix D) As summarized in Table 2 time com-plexity and space complexity of MHGRN on asparse graph are both linear wrt either the maxi-mum path length K or the number of nodes n

44 Expressive Power of MHGRN

In addition to efficiency and scalability we now dis-cuss the modeling capacity of MHGRN With themessage passing formulation and relation-specifictransformations it is by nature the generalizationof RGCN It is also capable of directly modelingpaths making it interpretable as are path-basedmodels like RN and KagNet To show this we

first generalize RN (Eq 3) to the multi-hop set-ting and introduce K-hop RN (formal definition inAppendix E) which models multi-hop relation asthe composition of single-hop relations We showthat MHGRN is capable of representingK-hop RN(proof in Appendix F)

45 Learning Inference and Path DecodingWe now discuss the learning and inference processof MHGRN instantiated for QA tasks Followingthe problem formulation in sect2 we aim to deter-mine the plausibility of an answer option a isin Cgiven the question q with the information fromboth text s and graph G We first obtain the graphrepresentation g by performing attentive poolingover the output node embeddings of answer entitieshprimei ∣ i isin A Next we concatenate it with the textrepresentation s and compute the plausibility scoreby ρ(qa) = MLP(soplus g)

During training we maximize the plausibilityscore of the correct answer a by minimizing thecross-entropy loss

L = EqaC [minus logexp(ρ(q a))

sumaisinC exp(ρ(qa))] (12)

The whole model is trained end-to-end jointly withthe text encoder (eg RoBERTa)

During inference we predict the most plau-sible answer by argmaxaisinC ρ(qa) Addition-ally we can decode a reasoning path as evidencefor model predictions endowing our model withthe interpretability enjoyed by path-based mod-els Specifically we first determine the answerentity ilowast with the highest score in the pooling layerand the path length klowast with the highest score inEq 8 Then the reasoning path is decoded byargmax α(j r1 rklowast i

lowast) which can be com-puted in linear time using dynamic programming

5 Experimental SetupWe introduce how we construct G (sect51) thedatasets (sect52) as well as the baseline methods(sect53) Appendix C shows more implementationand experimental details for reproducibility

51 Extracting G from External KGWe use ConceptNet (Speer et al 2017) a general-domain knowledge graph as our external KG to testmodelsrsquo ability to harness structured knowledgesource Following KagNet (Lin et al 2019) wemerge relation types to increase graph density andadd reverse relations to construct a multi-relational

Methods BERT-Base BERT-Large RoBERTa-Large

IHdev-Acc() IHtest-Acc() IHdev-Acc() IHtest-Acc() IHdev-Acc() IHtest-Acc()

wo KG 5731 (plusmn107) 5347 (plusmn087) 6106 (plusmn085) 5539 (plusmn040) 7307 (plusmn045) 6869(plusmn056)

RGCN (Schlichtkrull et al 2018) 5694 (plusmn038) 5450 (plusmn056) 6298 (plusmn082) 5713 (plusmn036) 7269 (plusmn019) 6841 (plusmn066)GconAttn (Wang et al 2019) 5727 (plusmn070) 5484 (plusmn088) 6317 (plusmn018) 5736 (plusmn090) 7261( plusmn039) 6859 (plusmn096)KagNetdagger (Lin et al 2019) 5557 5619 6235 5716 - -RN (1-hop) 5827 (plusmn022) 5620 (plusmn045) 6304 (plusmn058) 5846 (plusmn071) 7457 (plusmn091) 6908 (plusmn021)RN (2-hop) 5981 (plusmn076) 5661 (plusmn068) 6336 (plusmn026) 5892 (plusmn014) 7365 (plusmn309) 6959 (plusmn380)

MHGRN 6036 (plusmn023) 5723 (plusmn082) 6329(plusmn051) 6059 (plusmn058) 7445 (plusmn010) 7111 (plusmn081)

Table 3 Performance comparison on CommonsenseQA in-house split We report in-house Dev (IHdev) andTest (IHtest) accuracy (mean and standard deviation of four runs) using the data split of Lin et al (2019) onCommonsenseQA dagger indicates reported results in its paper

Methods Single Ensemble

UnifiedQAdagger (Khashabi et al 2020) 791 -

RoBERTadagger 721 725RoBERTa + KEDGNdagger 725 744RoBERTa + KEdagger 733 -RoBERTa + HyKAS 20dagger (Ma et al 2019) 732 -RoBERTa + FreeLBdagger(Zhu et al 2020) 722 731XLNet + DREAMdagger 669 733XLNet + GRdagger (Lv et al 2019) 753 -ALBERTdagger (Lan et al 2019) - 765

RoBERTa + MHGRN (K = 2) 754 765

Table 4 Performance comparison on official test ofCommonsenseQA with leaderboard SoTAs4 (accuracyin ) dagger indicates reported results on leaderboard Uni-fiedQA uses T5-11B as text encoder whose number ofparameters is about 30 times more than other models

graph with 34 relation types (details in AppendixA) To extract an informative contextualized graphG from KG we recognize entity mentions in s andlink them to entities in ConceptNet with which weinitialize our node set V We then add to V all theentities that appear in any two-hop paths betweenpairs of mentioned entities Unlike KagNet we donot perform any pruning but instead reserve all theedges between nodes in V forming our G

52 Datasets

We evaluate models on two multiple-choice ques-tion answering datasets CommonsenseQA andOpenbookQA Both require world knowledge be-yond textual understanding to perform well

CommonsenseQA (Talmor et al 2019) neces-sitates various commonsense reasoning skills Thequestions are created with entities from ConceptNetand they are designed to probe latent compositionalrelations between entities in ConceptNet

OpenBookQA (Mihaylov et al 2018) provides

4Models based on ConceptNet are no longer shown on theleaderboard and we got our results from the organizers

elementary science questions together with an openbook of science facts This dataset also probesgeneral common sense beyond the provided facts

53 Compared Methods

We implement both knowledge-agnostic fine-tuning of pre-trained LMs and models that incor-porate KG as external sources as our baselinesAdditionally we directly compare our model withthe results from corresponding leaderboard Thesemethods typically leverage textual knowledge orextra training data as opposed to external KG Inall our implemented models we use pre-trainedLMs as text encoders for s for fair comparisonWe do compare our models with those (Ma et al2019 Lv et al 2019 Khashabi et al 2020) aug-mented by other text-form external knowledge (egWikipedia) although we stick to our focus of en-coding structured KG

Specifically we fine-tune BERT-BASE BERT-LARGE (Devlin et al 2019) and ROBERTA (Liuet al 2019b) for multiple-choice questions Wetake RGCN (Eq 2 in sect3) RN5 (Eq 3 in sect3)KagNet (Eq 4 in sect3) and GconAttn (Wanget al 2019) as baselines GconAttn generalizesmatch-LSTM (Wang and Jiang 2016) and achievessuccess in language inference tasks

6 Results and Discussions

In this section we present the results of our modelsin comparison with baselines as well as methodson the leaderboards for both CommonsenseQA andOpenbookQA We also provide analysis of modelsrsquocomponents and characteristics

5We use mean pooling for 1-hop RN and attentive poolingfor 2-hop RN (detailed in Appendix E)

Methods Dev () Test ()

T5-3Bdagger (Raffel et al 2019) - 8320UnifiedQAdagger (Khashabi et al 2020) - 8720

RoBERTa-Large (wo KG) 6676 (plusmn114) 6480 (plusmn237)+ RGCN 6465 (plusmn196) 6245 (plusmn157)+ GconAttn 6430 (plusmn099) 6190 (plusmn244)+ RN (1-hop) 6485 (plusmn111) 6365 (plusmn231)+ RN (2-hop) 6700 (plusmn071) 6520 (plusmn118)+ MHGRN (K = 3) 6810 (plusmn102) 6685 (plusmn119)

AristoRoBERTaV7dagger 792 778+ MHGRN (K = 3) 786 806

Table 5 Performance comparison on OpenbookQAdagger indicates reported results on leaderboard T5-3B is 8times larger than our models UnifiedQA is 30x larger

61 Main ResultsFor CommonsenseQA (Table 3) we first use thein-house data split (IH) of Lin et al (2019) (cfAppendix B) to compare our models with imple-mented baselines This is different from the offi-cial split used in the leaderboard methods Almostall KG-augmented models achieve performancegain over vanilla pre-trained LMs demonstratingthe value of external knowledge on this dataset Ad-ditionally we evaluate our MHGRN (with text en-coder being ROBERTA-LARGE) on official splitOF (Table 4) for fair comparison with other meth-ods on leaderboard in both single-model settingand ensemble-model setting With backbone be-ing T5 (Raffel et al 2019) UnifiedQA (Khashabiet al 2020) tops the leaderboard Consideringits training cost we do not intend to compare ourROBERTA-based model with it We achieve thebest performances among other models

For OpenbookQA (Table 5) we use official splitand build models with ROBERTA-LARGE as textencoder MHGRN surpasses all implemented base-lines with an absolute increase of sim2 on TestAlso as our approach is naturally compatible withthe methods that utilize textual knowledge or ex-tra data because in our paradigm the encodingof textual statement and graph are structurally-decoupled (Fig 3) To empirically show MHGRNcan bring gain over textual-knowledge empow-ered systems we replace our text encoder withAristoRoBERTaV76 and fine-tune our MHGRNupon OpenbookQA Empirically MHGRN contin-ues to bring benefit to strong-performing textual-knowledge empowered systems This indicatestextual knowledge and structured knowledge canpotentially be complementary

6httpsleaderboardallenaiorgopen_book_qasubmissionblcp1tu91i4gm0vf484g

Methods IHdev-Acc ()

MHGRN (K = 3) 7445 (plusmn010)- Type-specific transformation (sect41) 7316 (plusmn028)- Structured relational attention (sect42) 7326 (plusmn031)- Relation type attention (sect42) 7355 (plusmn068)- Node type attention (sect42) 7392 (plusmn065)

Table 6 Ablation study on model components (re-moving one component each time) using ROBERTA-LARGE as the text encoder We report the IHdev ac-curacy on CommonsenseQA

62 Performance Analysis

Ablation Study on Model Components Asshown in Table 6 disabling type-specific trans-formation results in sim 13 drop in performancedemonstrating the need for distinguishing nodetypes for QA tasks Our structured relational at-tention mechanism is also critical with its twosub-components contributing almost equally

Impact of the Amount of Training Data Weuse different fractions of training data of Common-senseQA and report results of fine-tuning text en-coders alone and jointly training text encoder andgraph encoder in Fig 5 Regardless of training datafraction our model shows consistently more per-formance improvement over knowledge-agnosticfine-tuning compared with the other graph encod-ing methods indicating MHGRNrsquos complementarystrengths to text encoders

20 40 60 80 100Percentage of Training Data

45

50

55

60

65

Accu

racy

()

MultiGRNKVMGconAttnwo KG

Figure 5 Performance change (accuracy in ) wrtthe amounts of training data on CommonsenseQA IHT-est set (same as Lin et al (2019))

Impact of Number of Hops (K) We investigatethe impact of hyperparameter K for MHGRN byits performance on CommonsenseQA (Fig 6) Theincrease ofK continues to bring benefits untilK =

4 However performance begins to drop whenK gt 3 This might be attributed to exponentialnoise in longer relational paths in knowledge graph

65 70 75Accuracy ()

K=1

K=2

K=3

K=4

K=5

Reasoning Hops

7324

7393

7465

7365

7416

Figure 6 Effect ofK in MHGRN We show IHDev ac-curacy of MHGRN on CommonsenseQA wrt hops

2 4 6 800

05

10

15

20

sba

tch

K Reasoning Hops

RGCNMHGRN

Figure 7 Analysis of model scalability Comparisonof per-batch training efficiency wrt hops K

63 Model ScalabilityFig 7 presents the computation cost of MHGRNand RGCN (measured by training time) Both growlinearly wrt K Although the theoretical complex-ity of MultiRGN ism times that of RGCN the ratioof their empirical cost only approaches 2 demon-strating that our model can be better parallelized

64 Model InterpretabilityWe can analyze our modelrsquos reasoning process bydecoding the reasoning path using the method de-scribed in sect45 Fig 8 shows two examples fromCommonsenseQA where our model correctly an-swers the questions and provides reasonable pathevidences In the example on the left the modellinks question entities and answer entity in a chainto support reasoning while the example on the rightprovides a case where our model leverage unmen-tioned entities to bridge the reasoning gap betweenquestion entity and answer entities in a way that iscoherent with the latent relation between CHAPEL

and the desired answer in the question

7 Related WorkKnowledge-Aware Methods for NLP Variouswork have investigated the potential to empowerNLP models with external knowledge Many at-tempt to extract structured knowledge either in theform of nodes (Yang and Mitchell 2017 Wanget al 2019) triples (Weissenborn et al 2017

Where is known for a multitude of wedding chapels

A town B texasC city D church building E Nevada

RenoChapel

PartOf

AtLocation

UsedForminus1 Synonym

AtLocation Synonym

Neveda

Baseball

Play_Baseball

HasProperty

UsedForminus1

Fun_to_Play

Why do parents encourage their kids to play baseball

A round B cheap C break window D hardE fun to play Fun_to_Play

HasProperty

Reno

Chapel

AtLocation

Neveda

PartOf

MultiGRNFigure 8 Case study on model interpretability Wepresent two sample questions from CommonsenseQAwith the reasoning paths output by MHGRN

Mihaylov and Frank 2018) paths (Bauer et al2018 Kundu et al 2019 Lin et al 2019) or sub-graphs (Li and Clark 2015) and encode them toaugment textual understanding

Recent success of pre-trained LMs motivatesmany (Pan et al 2019 Ye et al 2019 Zhang et al2018 Li et al 2019 Banerjee et al 2019) to probeLMsrsquo potential as latent knowledge bases This lineof work turn to textual knowledge (eg Wikipedia)to directly impart knowledge to pre-trained LMsThey generally fall into two paradigms 1) Fine-tuning LMs on large-scale general-domain datasets(eg RACE (Lai et al 2017)) or on knowledge-richtext 2) Providing LMs with evidence via informa-tion retrieval techniques However these modelscannot provide explicit reasoning and evidencethus hardly trustworthy They are also subject tothe availability of in-domain datasets and maxi-mum input token of pre-trained LMs

Neural Graph Encoding Graph Attention Net-works (GAT) (Velickovic et al 2018) incorpo-rates attention mechanism in feature aggregationRGCN (Schlichtkrull et al 2018) proposes rela-tional message passing which makes it applicableto multi-relational graphs However they only per-form single-hop message passing and cannot beinterpreted at path level Other work (Abu-El-Haijaet al 2019 Nikolentzos et al 2019) aggregate fora node its K-hop neighbors based on node-wisedistances but they are designed for non-relationalgraphs MHGRN addresses these issues by rea-soning on multi-relational graphs and being inter-pretable via maintaining paths as reasoning chains

8 Conclusion

We present a principled scalable method MHGRNthat can leverage general knowledge via multi-hopreasoning over interpretable structures (eg Con-ceptNet) The proposed MHGRN generalizes andcombines the advantages of GNNs and path-basedreasoning models It explicitly performs multi-hop

relational reasoning and is empirically shown tooutperform existing methods with superior scal-ablility and interpretability

ReferencesSami Abu-El-Haija Bryan Perozzi Amol Kapoor

Nazanin Alipourfard Kristina Lerman Hrayr Haru-tyunyan Greg Ver Steeg and Aram Galstyan 2019Mixhop Higher-order graph convolutional architec-tures via sparsified neighborhood mixing In Pro-ceedings of the 36th International Conference onMachine Learning ICML 2019 9-15 June 2019Long Beach California USA volume 97 of Pro-ceedings of Machine Learning Research pages 21ndash29 PMLR

Pratyay Banerjee Kuntal Kumar Pal Arindam Mi-tra and Chitta Baral 2019 Careful selection ofknowledge to solve open book question answeringIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages6120ndash6129 Florence Italy Association for Compu-tational Linguistics

Lisa Bauer Yicheng Wang and Mohit Bansal 2018Commonsense for generative multi-hop question an-swering tasks In Proceedings of the 2018 Confer-ence on Empirical Methods in Natural LanguageProcessing Brussels Belgium October 31 - Novem-ber 4 2018 pages 4220ndash4230 Association for Com-putational Linguistics

Peter Clark Isaac Cowhey Oren Etzioni Tushar KhotAshish Sabharwal Carissa Schoenick and OyvindTafjord 2018 Think you have solved questionanswering try arc the AI2 reasoning challengeCoRR abs180305457

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

Justin Gilmer Samuel S Schoenholz Patrick F RileyOriol Vinyals and George E Dahl 2017 Neuralmessage passing for quantum chemistry In Pro-ceedings of the 34th International Conference onMachine Learning ICML 2017 Sydney NSW Aus-tralia 6-11 August 2017 volume 70 of Proceedingsof Machine Learning Research pages 1263ndash1272PMLR

Daniel Khashabi Tushar Khot Ashish SabharwalOyvind Tafjord Peter Clark and Hannaneh Ha-jishirzi 2020 Unifiedqa Crossing format bound-aries with a single qa system

Thomas N Kipf and Max Welling 2017 Semi-supervised classification with graph convolutionalnetworks In 5th International Conference on Learn-ing Representations ICLR 2017 Toulon FranceApril 24-26 2017 Conference Track ProceedingsOpenReviewnet

Souvik Kundu Tushar Khot Ashish Sabharwal andPeter Clark 2019 Exploiting explicit paths formulti-hop reading comprehension In Proceedingsof the 57th Annual Meeting of the Association forComputational Linguistics pages 2737ndash2747 Flo-rence Italy Association for Computational Linguis-tics

John D Lafferty Andrew McCallum and FernandoC N Pereira 2001 Conditional random fieldsProbabilistic models for segmenting and labeling se-quence data In Proceedings of the Eighteenth Inter-national Conference on Machine Learning (ICML2001) Williams College Williamstown MA USAJune 28 - July 1 2001 pages 282ndash289 MorganKaufmann

Guokun Lai Qizhe Xie Hanxiao Liu Yiming Yangand Eduard Hovy 2017 RACE Large-scale ReAd-ing comprehension dataset from examinations InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing pages785ndash794 Copenhagen Denmark Association forComputational Linguistics

Zhen-Zhong Lan Mingda Chen Sebastian Good-man Kevin Gimpel Piyush Sharma and RaduSoricut 2019 Albert A lite bert for self-supervised learning of language representationsArXiv abs190911942

Shiyang Li Jianshu Chen and Dian Yu 2019 Teach-ing pretrained models with commonsense reason-ing A preliminary kb-based approach CoRRabs190909743

Yang Li and Peter Clark 2015 Answering elementaryscience questions by constructing coherent scenesusing background knowledge In Proceedings ofthe 2015 Conference on Empirical Methods in Nat-ural Language Processing pages 2007ndash2012 Lis-bon Portugal Association for Computational Lin-guistics

Bill Yuchen Lin Xinyue Chen Jamin Chen and Xi-ang Ren 2019 KagNet Knowledge-aware graphnetworks for commonsense reasoning In Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) pages 2829ndash2839 HongKong China Association for Computational Lin-guistics

Liyuan Liu Haoming Jiang Pengcheng He WeizhuChen Xiaodong Liu Jianfeng Gao and Jiawei Han2019a On the variance of the adaptive learning rateand beyond CoRR abs190803265

Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019bRoberta A robustly optimized BERT pretraining ap-proach CoRR abs190711692

Shangwen Lv Daya Guo Jingjing Xu Duyu TangNan Duan Ming Gong Linjun Shou Daxin JiangGuihong Cao and Songlin Hu 2019 Graph-based reasoning over heterogeneous external knowl-edge for commonsense question answering CoRRabs190905311

Kaixin Ma Jonathan Francis Quanyang Lu Eric Ny-berg and Alessandro Oltramari 2019 Towardsgeneralizable neuro-symbolic systems for common-sense question answering In Proceedings of theFirst Workshop on Commonsense Inference in Natu-ral Language Processing pages 22ndash32 Hong KongChina Association for Computational Linguistics

Todor Mihaylov Peter Clark Tushar Khot and AshishSabharwal 2018 Can a suit of armor conduct elec-tricity a new dataset for open book question an-swering In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processingpages 2381ndash2391 Brussels Belgium Associationfor Computational Linguistics

Todor Mihaylov and Anette Frank 2018 Knowledge-able reader Enhancing cloze-style reading compre-hension with external commonsense knowledge InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 821ndash832 Melbourne AustraliaAssociation for Computational Linguistics

Giannis Nikolentzos George Dasoulas and MichalisVazirgiannis 2019 k-hop graph neural networksCoRR abs190706051

Xiaoman Pan Kai Sun Dian Yu Heng Ji and Dong Yu2019 Improving question answering with externalknowledge CoRR abs190200993

Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2019 Exploring the limitsof transfer learning with a unified text-to-text trans-former

Adam Santoro David Raposo David G T BarrettMateusz Malinowski Razvan Pascanu Peter WBattaglia and Tim Lillicrap 2017 A simple neu-ral network module for relational reasoning In Ad-vances in Neural Information Processing Systems30 Annual Conference on Neural Information Pro-cessing Systems 2017 4-9 December 2017 LongBeach CA USA pages 4967ndash4976

Maarten Sap Hannah Rashkin Derek Chen RonanLe Bras and Yejin Choi 2019 Social IQa Com-monsense reasoning about social interactions InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the

9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) pages 4463ndash4473 Hong Kong China Association for Computa-tional Linguistics

Michael Sejr Schlichtkrull Thomas N Kipf PeterBloem Rianne van den Berg Ivan Titov and MaxWelling 2018 Modeling relational data with graphconvolutional networks In The Semantic Web - 15thInternational Conference ESWC 2018 HeraklionCrete Greece June 3-7 2018 Proceedings volume10843 of Lecture Notes in Computer Science pages593ndash607 Springer

Robyn Speer Joshua Chin and Catherine Havasi 2017Conceptnet 55 An open multilingual graph of gen-eral knowledge In Proceedings of the Thirty-FirstAAAI Conference on Artificial Intelligence Febru-ary 4-9 2017 San Francisco California USApages 4444ndash4451 AAAI Press

Alon Talmor Jonathan Herzig Nicholas Lourie andJonathan Berant 2019 CommonsenseQA A ques-tion answering challenge targeting commonsenseknowledge In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4149ndash4158 Minneapolis Minnesota Associ-ation for Computational Linguistics

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez LukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems 30 Annual Conference on NeuralInformation Processing Systems 2017 4-9 Decem-ber 2017 Long Beach CA USA pages 5998ndash6008

Petar Velickovic Guillem Cucurull Arantxa CasanovaAdriana Romero Pietro Lio and Yoshua Bengio2018 Graph attention networks In 6th Inter-national Conference on Learning RepresentationsICLR 2018 Vancouver BC Canada April 30 - May3 2018 Conference Track Proceedings OpenRe-viewnet

Shuohang Wang and Jing Jiang 2016 Learning nat-ural language inference with LSTM In Proceed-ings of the 2016 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics Human Language Technologies pages1442ndash1451 San Diego California Association forComputational Linguistics

Xiaoyan Wang Pavan Kapanipathi Ryan MusaMo Yu Kartik Talamadupula Ibrahim AbdelazizMaria Chang Achille Fokoue Bassem MakniNicholas Mattei and Michael Witbrock 2019 Im-proving natural language inference using externalknowledge in the science questions domain In TheThirty-Third AAAI Conference on Artificial Intelli-gence AAAI 2019 The Thirty-First Innovative Ap-plications of Artificial Intelligence Conference IAAI2019 The Ninth AAAI Symposium on Educational

Advances in Artificial Intelligence EAAI 2019 Hon-olulu Hawaii USA January 27 - February 1 2019pages 7208ndash7215 AAAI Press

Dirk Weissenborn Tomas Kocisky and Chris Dyer2017 Dynamic integration of background knowl-edge in neural nlu systems arXiv preprintarXiv170602596

An Yang Quan Wang Jing Liu Kai Liu Yajuan LyuHua Wu Qiaoqiao She and Sujian Li 2019 En-hancing pre-trained language representations withrich knowledge for machine reading comprehensionIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2346ndash2357 Florence Italy Association for Compu-tational Linguistics

Bishan Yang and Tom Mitchell 2017 Leveragingknowledge bases in LSTMs for improving machinereading In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1 Long Papers) pages 1436ndash1446 Van-couver Canada Association for Computational Lin-guistics

Zhi-Xiu Ye Qian Chen Wen Wang and Zhen-HuaLing 2019 Align mask and select A sim-ple method for incorporating commonsense knowl-edge into language representation models CoRRabs190806725

Yuyu Zhang Hanjun Dai Kamil Toraman andLe Song 2018 Kgˆ2 Learning to reason scienceexam questions with contextual knowledge graphembeddings CoRR abs180512393

Chen Zhu Yu Cheng Zhe Gan Siqi Sun Tom Gold-stein and Jingjing Liu 2020 Freelb Enhanced ad-versarial training for natural language understandingIn International Conference on Learning Represen-tations

A Merging Types of Relations inConceptNet

Relation Merged Relation

AtLocationAtLocation

LocatedNear

CausesCausesCausesDesire

MotivatedByGoal

AntonymAntonym

DistinctFrom

HasSubevent

HasSubevent

HasFirstSubeventHasLastSubeventHasPrerequisite

EntailsMannerOf

IsAIsAInstanceOf

DefinedAs

PartOfPartOf

HasA

RelatedToRelatedToSimilarTo

Synonym

Table 7 Relations in ConceptNet that are being mergedin pre-processing RelationX indicates the reverse re-lation of RelationX

We merge relations that are close in semanticsas well as in the general usage of triple instancesin ConceptNet

B Dataset Split Specifications

Train Dev Test

CommonsenseQA (OF) 9 741 1 221 1 140CommonsenseQA (IH) 8 500 1 221 1 241OpenbookQA 4 957 500 500

Table 8 Numbers of instances in different datasetsplits

CommonsenseQA 7 and OpenbookQA 8 all havetheir leaderboards with training and development

7httpswwwtau-nlporgcommonsenseqa8httpsleaderboardallenaiorgopen_

book_qasubmissionspublic

set publicly available As the ground truth labelsfor CommonsenseQA are not readily available formodel analysis we take 1241 examples from of-ficial training examples as our in-house test ex-amples and regard the remaining 8500 ones asour in-house training examples (CommonsenseQA(IH))

C Implementation Details

CommonsenseQA OpenbookQA

BERT-BASE 3 times 10minus5 -

BERT-LARGE 2 times 10minus5 -

ROBERTA-LARGE 1 times 10minus5

1 times 10minus5

Table 9 Learning rate for text encoders on differentdatasets

CommonsenseQA OpenbookQA

RN 3 times 10minus4

3 times 10minus4

RGCN 1 times 10minus3

1 times 10minus3

GconAttn 3 times 10minus4

1 times 10minus4

MHGRN 1 times 10minus3

1 times 10minus3

Table 10 Learning rate for graph encoders on differentdatasets

Param

RN 399KRGCN 365KGconAttn 453KMHGRN 544K

Table 11 Numbers of parameters of different graph en-coders

Our models are implemented in PyTorch We usecross-entropy loss and RAdam (Liu et al 2019a)optimizer We find it beneficial to use separatelearning rates for the text encoder and the graph en-coder We tune learning rates for text encoders andgraph encoders on two datasets We first fine-tuneROBERTA-LARGE BERT-LARGE BERT-BASE

on CommonsenseQA and ROBERTA-LARGE onOpenbookQA respectively and choose a dataset-specific learning rate from 1times10

minus5 2times10

minus5 3times

10minus5 6 times 10

minus5 1 times 10

minus4 for each text encoderbased on the best performance on development setas listed in Table 9 We report the performance ofthese fine-tuned text encoders and also adopt theirdataset-specific optimal learning rates in joint train-ing with graph encoders For models that involve

KG the learning rate of their graph encoders arechosen from 1 times 10

minus4 3 times 10

minus4 1 times 10

minus3 3 times

10minus3 based on their best development set perfor-

mance with ROBERTA-LARGE as the text encoderWe report the optimal learning rates for graph en-coders in Table 10 In training we set the max-imum input sequence length to text encoders to64 batch size to 32 and perform early stoppingAristoRoBERTaV7+MHGRN is the only exceptionIn order to host fair comparison we follow Aris-toRoBERTaV7 and set the batch size to 16 maxinput sequence length to 256 and choose a decoderlearning rate from 1 times 10

minus3 2 times 10

minus5For the input node features we first use tem-

plates to turn knowledge triples in ConceptNet intosentences and feed them into pre-trained BERT-LARGE obtaining a sequence of token embeddingsfrom the last layer of BERT-LARGE for each tripleFor each entity in ConceptNet we perform meanpooling over the tokens of the entityrsquos occurrencesacross all the sentences to form a 1024d vector asits corresponding node feature We use this set offeatures for all our implemented models

We use 2-layer RGCN and single-layer MHGRNacross our experiments

The numbers of parameter for each graph en-coder are listed in Table 11

D Dynamic Programming Algorithm forEq 7

To show that multi-hop message passing can becomputed in linear time we observe that Eq 7 canbe re-written in matrix form

Zk= (Dk)minus1 sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FXW1r1

⊤⋯W

krk

sdotWk+10

⊤⋯W

K0

⊤ (1 le k le K) (13)

where G = diag(exp([g(φ(v1) s) g(φ(vn)s)]) (F is similarly defined) Ar is the adjacencymatrix for relation r and D

k is defined as follows

Dk= diag( sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FX1) (1 le k le K) (14)

Using this matrix formulation we can compute Eq7 using dynamic programming

Algorithm 1 Dynamic programming algorithm formulti-hop message passingInput sXAr(1 le r le m)W t

r (r isin R 1 le t le

k)F G δ τOutput Z

1 WKlarr I

2 for k larr K minus 1 to 1 do3 W

klarrW

k+10 W

k+1

4 end for5 for r isin R do6 Mr larr FX7 end for8 for k larr 1 to K do9 if k gt 1 then

10 for r isin R do11 M

primer larr e

δ(rs)ArsumrprimeisinR e

τ(rprimers)MrprimeW

kr

12 end for13 for r isin R do14 Mr larrM

primer

15 end for16 else17 for r isin R do18 Mr larr e

δ(rs) sdotArMrWkr

19 end for20 end if21 Z

klarr GsumrisinR MrW

k

22 end for23 Replace W

tr (0 le r le m 1 le t le k) with identity

matrices and X with 1 and re-run line 1 - line 19 tocompute d

1 d

K

24 for k larr 1 to K do25 Z

klarr (diag(dk))minus1

Zk

26 end for27 return Z

1Z

2 Z

k

E Formal Definition of K-hop RN

Definition 1 (K-hop Relation Network) A multi-hop relation network is a function that maps amulti-relational graph to a fixed size vector

KHopRN(G W E H) =K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)sdotW (hjoplus(er1⋯erk)oplushi)(15)

where denotes element-wise product and β(⋯) =1(K∣A∣ sdot ∣(j prime i) isin G ∣ j prime isin Q∣) defines thepooling weights

F Expressing K-hop RN with MultiGRN

Theorem 1 Given any W E H there exists a pa-rameter setting such that the output of the modelbecomes KHopRN(GW E H) for arbitrary G

Proof Suppose W = [W1 W2 W3] whereW1 W3 isin Rd3timesd1 W2 isin Rd3timesd2 ForMHRGN we set the parameters as followsH = HUlowast = [I0] isin R(d1+d2)timesd1 blowast =

[01]⊤ isin Rd1+d2 Wtr = diag(1 oplus er) isin

R(d1+d2)times(d1+d2)(r isin R 1 le t le K)V = W3 isin

Rd3timesd1 Vprime= [W1 W2] isin Rd3times(d1+d2) We dis-

able the relation type attention module and enablemessage passing only from Q to A By furtherchoosing σ as the identity function and perform-ing pooling over A we observe that the output of

MultiGRN becomes

1

∣A∣ sumiisinA

hprimei

=1

∣A∣ (V hi + Vprimezi)

=1

K∣A∣K

sumk=1

(V hi + Vprimezki )

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)(V hi+

VprimeW

krk⋯W

1r1xj)

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ

β(⋯)(V hi + VprimeW

krk

⋯W1r1Uφ(j)hj + V

primeW

krk⋯W

1r1bφ(j))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)(W3hi + W1hj

+ W2(er1 ⋯ erk))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)W (hjoplus

(er1 ⋯ erk)oplus hi)= RN(G W E H)

(16)

Page 6: arXiv:2005.00646v1 [cs.CL] 1 May 20200 80 #node 3.1e3 #path K=2 0 80 #node 6.3e4 K=2 K=3 Figure 2: Number of K-hop relational paths w.r.t. the node count in extracted graphs on Common-senseQA

Methods BERT-Base BERT-Large RoBERTa-Large

IHdev-Acc() IHtest-Acc() IHdev-Acc() IHtest-Acc() IHdev-Acc() IHtest-Acc()

wo KG 5731 (plusmn107) 5347 (plusmn087) 6106 (plusmn085) 5539 (plusmn040) 7307 (plusmn045) 6869(plusmn056)

RGCN (Schlichtkrull et al 2018) 5694 (plusmn038) 5450 (plusmn056) 6298 (plusmn082) 5713 (plusmn036) 7269 (plusmn019) 6841 (plusmn066)GconAttn (Wang et al 2019) 5727 (plusmn070) 5484 (plusmn088) 6317 (plusmn018) 5736 (plusmn090) 7261( plusmn039) 6859 (plusmn096)KagNetdagger (Lin et al 2019) 5557 5619 6235 5716 - -RN (1-hop) 5827 (plusmn022) 5620 (plusmn045) 6304 (plusmn058) 5846 (plusmn071) 7457 (plusmn091) 6908 (plusmn021)RN (2-hop) 5981 (plusmn076) 5661 (plusmn068) 6336 (plusmn026) 5892 (plusmn014) 7365 (plusmn309) 6959 (plusmn380)

MHGRN 6036 (plusmn023) 5723 (plusmn082) 6329(plusmn051) 6059 (plusmn058) 7445 (plusmn010) 7111 (plusmn081)

Table 3 Performance comparison on CommonsenseQA in-house split We report in-house Dev (IHdev) andTest (IHtest) accuracy (mean and standard deviation of four runs) using the data split of Lin et al (2019) onCommonsenseQA dagger indicates reported results in its paper

Methods Single Ensemble

UnifiedQAdagger (Khashabi et al 2020) 791 -

RoBERTadagger 721 725RoBERTa + KEDGNdagger 725 744RoBERTa + KEdagger 733 -RoBERTa + HyKAS 20dagger (Ma et al 2019) 732 -RoBERTa + FreeLBdagger(Zhu et al 2020) 722 731XLNet + DREAMdagger 669 733XLNet + GRdagger (Lv et al 2019) 753 -ALBERTdagger (Lan et al 2019) - 765

RoBERTa + MHGRN (K = 2) 754 765

Table 4 Performance comparison on official test ofCommonsenseQA with leaderboard SoTAs4 (accuracyin ) dagger indicates reported results on leaderboard Uni-fiedQA uses T5-11B as text encoder whose number ofparameters is about 30 times more than other models

graph with 34 relation types (details in AppendixA) To extract an informative contextualized graphG from KG we recognize entity mentions in s andlink them to entities in ConceptNet with which weinitialize our node set V We then add to V all theentities that appear in any two-hop paths betweenpairs of mentioned entities Unlike KagNet we donot perform any pruning but instead reserve all theedges between nodes in V forming our G

52 Datasets

We evaluate models on two multiple-choice ques-tion answering datasets CommonsenseQA andOpenbookQA Both require world knowledge be-yond textual understanding to perform well

CommonsenseQA (Talmor et al 2019) neces-sitates various commonsense reasoning skills Thequestions are created with entities from ConceptNetand they are designed to probe latent compositionalrelations between entities in ConceptNet

OpenBookQA (Mihaylov et al 2018) provides

4Models based on ConceptNet are no longer shown on theleaderboard and we got our results from the organizers

elementary science questions together with an openbook of science facts This dataset also probesgeneral common sense beyond the provided facts

53 Compared Methods

We implement both knowledge-agnostic fine-tuning of pre-trained LMs and models that incor-porate KG as external sources as our baselinesAdditionally we directly compare our model withthe results from corresponding leaderboard Thesemethods typically leverage textual knowledge orextra training data as opposed to external KG Inall our implemented models we use pre-trainedLMs as text encoders for s for fair comparisonWe do compare our models with those (Ma et al2019 Lv et al 2019 Khashabi et al 2020) aug-mented by other text-form external knowledge (egWikipedia) although we stick to our focus of en-coding structured KG

Specifically we fine-tune BERT-BASE BERT-LARGE (Devlin et al 2019) and ROBERTA (Liuet al 2019b) for multiple-choice questions Wetake RGCN (Eq 2 in sect3) RN5 (Eq 3 in sect3)KagNet (Eq 4 in sect3) and GconAttn (Wanget al 2019) as baselines GconAttn generalizesmatch-LSTM (Wang and Jiang 2016) and achievessuccess in language inference tasks

6 Results and Discussions

In this section we present the results of our modelsin comparison with baselines as well as methodson the leaderboards for both CommonsenseQA andOpenbookQA We also provide analysis of modelsrsquocomponents and characteristics

5We use mean pooling for 1-hop RN and attentive poolingfor 2-hop RN (detailed in Appendix E)

Methods Dev () Test ()

T5-3Bdagger (Raffel et al 2019) - 8320UnifiedQAdagger (Khashabi et al 2020) - 8720

RoBERTa-Large (wo KG) 6676 (plusmn114) 6480 (plusmn237)+ RGCN 6465 (plusmn196) 6245 (plusmn157)+ GconAttn 6430 (plusmn099) 6190 (plusmn244)+ RN (1-hop) 6485 (plusmn111) 6365 (plusmn231)+ RN (2-hop) 6700 (plusmn071) 6520 (plusmn118)+ MHGRN (K = 3) 6810 (plusmn102) 6685 (plusmn119)

AristoRoBERTaV7dagger 792 778+ MHGRN (K = 3) 786 806

Table 5 Performance comparison on OpenbookQAdagger indicates reported results on leaderboard T5-3B is 8times larger than our models UnifiedQA is 30x larger

61 Main ResultsFor CommonsenseQA (Table 3) we first use thein-house data split (IH) of Lin et al (2019) (cfAppendix B) to compare our models with imple-mented baselines This is different from the offi-cial split used in the leaderboard methods Almostall KG-augmented models achieve performancegain over vanilla pre-trained LMs demonstratingthe value of external knowledge on this dataset Ad-ditionally we evaluate our MHGRN (with text en-coder being ROBERTA-LARGE) on official splitOF (Table 4) for fair comparison with other meth-ods on leaderboard in both single-model settingand ensemble-model setting With backbone be-ing T5 (Raffel et al 2019) UnifiedQA (Khashabiet al 2020) tops the leaderboard Consideringits training cost we do not intend to compare ourROBERTA-based model with it We achieve thebest performances among other models

For OpenbookQA (Table 5) we use official splitand build models with ROBERTA-LARGE as textencoder MHGRN surpasses all implemented base-lines with an absolute increase of sim2 on TestAlso as our approach is naturally compatible withthe methods that utilize textual knowledge or ex-tra data because in our paradigm the encodingof textual statement and graph are structurally-decoupled (Fig 3) To empirically show MHGRNcan bring gain over textual-knowledge empow-ered systems we replace our text encoder withAristoRoBERTaV76 and fine-tune our MHGRNupon OpenbookQA Empirically MHGRN contin-ues to bring benefit to strong-performing textual-knowledge empowered systems This indicatestextual knowledge and structured knowledge canpotentially be complementary

6httpsleaderboardallenaiorgopen_book_qasubmissionblcp1tu91i4gm0vf484g

Methods IHdev-Acc ()

MHGRN (K = 3) 7445 (plusmn010)- Type-specific transformation (sect41) 7316 (plusmn028)- Structured relational attention (sect42) 7326 (plusmn031)- Relation type attention (sect42) 7355 (plusmn068)- Node type attention (sect42) 7392 (plusmn065)

Table 6 Ablation study on model components (re-moving one component each time) using ROBERTA-LARGE as the text encoder We report the IHdev ac-curacy on CommonsenseQA

62 Performance Analysis

Ablation Study on Model Components Asshown in Table 6 disabling type-specific trans-formation results in sim 13 drop in performancedemonstrating the need for distinguishing nodetypes for QA tasks Our structured relational at-tention mechanism is also critical with its twosub-components contributing almost equally

Impact of the Amount of Training Data Weuse different fractions of training data of Common-senseQA and report results of fine-tuning text en-coders alone and jointly training text encoder andgraph encoder in Fig 5 Regardless of training datafraction our model shows consistently more per-formance improvement over knowledge-agnosticfine-tuning compared with the other graph encod-ing methods indicating MHGRNrsquos complementarystrengths to text encoders

20 40 60 80 100Percentage of Training Data

45

50

55

60

65

Accu

racy

()

MultiGRNKVMGconAttnwo KG

Figure 5 Performance change (accuracy in ) wrtthe amounts of training data on CommonsenseQA IHT-est set (same as Lin et al (2019))

Impact of Number of Hops (K) We investigatethe impact of hyperparameter K for MHGRN byits performance on CommonsenseQA (Fig 6) Theincrease ofK continues to bring benefits untilK =

4 However performance begins to drop whenK gt 3 This might be attributed to exponentialnoise in longer relational paths in knowledge graph

65 70 75Accuracy ()

K=1

K=2

K=3

K=4

K=5

Reasoning Hops

7324

7393

7465

7365

7416

Figure 6 Effect ofK in MHGRN We show IHDev ac-curacy of MHGRN on CommonsenseQA wrt hops

2 4 6 800

05

10

15

20

sba

tch

K Reasoning Hops

RGCNMHGRN

Figure 7 Analysis of model scalability Comparisonof per-batch training efficiency wrt hops K

63 Model ScalabilityFig 7 presents the computation cost of MHGRNand RGCN (measured by training time) Both growlinearly wrt K Although the theoretical complex-ity of MultiRGN ism times that of RGCN the ratioof their empirical cost only approaches 2 demon-strating that our model can be better parallelized

64 Model InterpretabilityWe can analyze our modelrsquos reasoning process bydecoding the reasoning path using the method de-scribed in sect45 Fig 8 shows two examples fromCommonsenseQA where our model correctly an-swers the questions and provides reasonable pathevidences In the example on the left the modellinks question entities and answer entity in a chainto support reasoning while the example on the rightprovides a case where our model leverage unmen-tioned entities to bridge the reasoning gap betweenquestion entity and answer entities in a way that iscoherent with the latent relation between CHAPEL

and the desired answer in the question

7 Related WorkKnowledge-Aware Methods for NLP Variouswork have investigated the potential to empowerNLP models with external knowledge Many at-tempt to extract structured knowledge either in theform of nodes (Yang and Mitchell 2017 Wanget al 2019) triples (Weissenborn et al 2017

Where is known for a multitude of wedding chapels

A town B texasC city D church building E Nevada

RenoChapel

PartOf

AtLocation

UsedForminus1 Synonym

AtLocation Synonym

Neveda

Baseball

Play_Baseball

HasProperty

UsedForminus1

Fun_to_Play

Why do parents encourage their kids to play baseball

A round B cheap C break window D hardE fun to play Fun_to_Play

HasProperty

Reno

Chapel

AtLocation

Neveda

PartOf

MultiGRNFigure 8 Case study on model interpretability Wepresent two sample questions from CommonsenseQAwith the reasoning paths output by MHGRN

Mihaylov and Frank 2018) paths (Bauer et al2018 Kundu et al 2019 Lin et al 2019) or sub-graphs (Li and Clark 2015) and encode them toaugment textual understanding

Recent success of pre-trained LMs motivatesmany (Pan et al 2019 Ye et al 2019 Zhang et al2018 Li et al 2019 Banerjee et al 2019) to probeLMsrsquo potential as latent knowledge bases This lineof work turn to textual knowledge (eg Wikipedia)to directly impart knowledge to pre-trained LMsThey generally fall into two paradigms 1) Fine-tuning LMs on large-scale general-domain datasets(eg RACE (Lai et al 2017)) or on knowledge-richtext 2) Providing LMs with evidence via informa-tion retrieval techniques However these modelscannot provide explicit reasoning and evidencethus hardly trustworthy They are also subject tothe availability of in-domain datasets and maxi-mum input token of pre-trained LMs

Neural Graph Encoding Graph Attention Net-works (GAT) (Velickovic et al 2018) incorpo-rates attention mechanism in feature aggregationRGCN (Schlichtkrull et al 2018) proposes rela-tional message passing which makes it applicableto multi-relational graphs However they only per-form single-hop message passing and cannot beinterpreted at path level Other work (Abu-El-Haijaet al 2019 Nikolentzos et al 2019) aggregate fora node its K-hop neighbors based on node-wisedistances but they are designed for non-relationalgraphs MHGRN addresses these issues by rea-soning on multi-relational graphs and being inter-pretable via maintaining paths as reasoning chains

8 Conclusion

We present a principled scalable method MHGRNthat can leverage general knowledge via multi-hopreasoning over interpretable structures (eg Con-ceptNet) The proposed MHGRN generalizes andcombines the advantages of GNNs and path-basedreasoning models It explicitly performs multi-hop

relational reasoning and is empirically shown tooutperform existing methods with superior scal-ablility and interpretability

ReferencesSami Abu-El-Haija Bryan Perozzi Amol Kapoor

Nazanin Alipourfard Kristina Lerman Hrayr Haru-tyunyan Greg Ver Steeg and Aram Galstyan 2019Mixhop Higher-order graph convolutional architec-tures via sparsified neighborhood mixing In Pro-ceedings of the 36th International Conference onMachine Learning ICML 2019 9-15 June 2019Long Beach California USA volume 97 of Pro-ceedings of Machine Learning Research pages 21ndash29 PMLR

Pratyay Banerjee Kuntal Kumar Pal Arindam Mi-tra and Chitta Baral 2019 Careful selection ofknowledge to solve open book question answeringIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages6120ndash6129 Florence Italy Association for Compu-tational Linguistics

Lisa Bauer Yicheng Wang and Mohit Bansal 2018Commonsense for generative multi-hop question an-swering tasks In Proceedings of the 2018 Confer-ence on Empirical Methods in Natural LanguageProcessing Brussels Belgium October 31 - Novem-ber 4 2018 pages 4220ndash4230 Association for Com-putational Linguistics

Peter Clark Isaac Cowhey Oren Etzioni Tushar KhotAshish Sabharwal Carissa Schoenick and OyvindTafjord 2018 Think you have solved questionanswering try arc the AI2 reasoning challengeCoRR abs180305457

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

Justin Gilmer Samuel S Schoenholz Patrick F RileyOriol Vinyals and George E Dahl 2017 Neuralmessage passing for quantum chemistry In Pro-ceedings of the 34th International Conference onMachine Learning ICML 2017 Sydney NSW Aus-tralia 6-11 August 2017 volume 70 of Proceedingsof Machine Learning Research pages 1263ndash1272PMLR

Daniel Khashabi Tushar Khot Ashish SabharwalOyvind Tafjord Peter Clark and Hannaneh Ha-jishirzi 2020 Unifiedqa Crossing format bound-aries with a single qa system

Thomas N Kipf and Max Welling 2017 Semi-supervised classification with graph convolutionalnetworks In 5th International Conference on Learn-ing Representations ICLR 2017 Toulon FranceApril 24-26 2017 Conference Track ProceedingsOpenReviewnet

Souvik Kundu Tushar Khot Ashish Sabharwal andPeter Clark 2019 Exploiting explicit paths formulti-hop reading comprehension In Proceedingsof the 57th Annual Meeting of the Association forComputational Linguistics pages 2737ndash2747 Flo-rence Italy Association for Computational Linguis-tics

John D Lafferty Andrew McCallum and FernandoC N Pereira 2001 Conditional random fieldsProbabilistic models for segmenting and labeling se-quence data In Proceedings of the Eighteenth Inter-national Conference on Machine Learning (ICML2001) Williams College Williamstown MA USAJune 28 - July 1 2001 pages 282ndash289 MorganKaufmann

Guokun Lai Qizhe Xie Hanxiao Liu Yiming Yangand Eduard Hovy 2017 RACE Large-scale ReAd-ing comprehension dataset from examinations InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing pages785ndash794 Copenhagen Denmark Association forComputational Linguistics

Zhen-Zhong Lan Mingda Chen Sebastian Good-man Kevin Gimpel Piyush Sharma and RaduSoricut 2019 Albert A lite bert for self-supervised learning of language representationsArXiv abs190911942

Shiyang Li Jianshu Chen and Dian Yu 2019 Teach-ing pretrained models with commonsense reason-ing A preliminary kb-based approach CoRRabs190909743

Yang Li and Peter Clark 2015 Answering elementaryscience questions by constructing coherent scenesusing background knowledge In Proceedings ofthe 2015 Conference on Empirical Methods in Nat-ural Language Processing pages 2007ndash2012 Lis-bon Portugal Association for Computational Lin-guistics

Bill Yuchen Lin Xinyue Chen Jamin Chen and Xi-ang Ren 2019 KagNet Knowledge-aware graphnetworks for commonsense reasoning In Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) pages 2829ndash2839 HongKong China Association for Computational Lin-guistics

Liyuan Liu Haoming Jiang Pengcheng He WeizhuChen Xiaodong Liu Jianfeng Gao and Jiawei Han2019a On the variance of the adaptive learning rateand beyond CoRR abs190803265

Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019bRoberta A robustly optimized BERT pretraining ap-proach CoRR abs190711692

Shangwen Lv Daya Guo Jingjing Xu Duyu TangNan Duan Ming Gong Linjun Shou Daxin JiangGuihong Cao and Songlin Hu 2019 Graph-based reasoning over heterogeneous external knowl-edge for commonsense question answering CoRRabs190905311

Kaixin Ma Jonathan Francis Quanyang Lu Eric Ny-berg and Alessandro Oltramari 2019 Towardsgeneralizable neuro-symbolic systems for common-sense question answering In Proceedings of theFirst Workshop on Commonsense Inference in Natu-ral Language Processing pages 22ndash32 Hong KongChina Association for Computational Linguistics

Todor Mihaylov Peter Clark Tushar Khot and AshishSabharwal 2018 Can a suit of armor conduct elec-tricity a new dataset for open book question an-swering In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processingpages 2381ndash2391 Brussels Belgium Associationfor Computational Linguistics

Todor Mihaylov and Anette Frank 2018 Knowledge-able reader Enhancing cloze-style reading compre-hension with external commonsense knowledge InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 821ndash832 Melbourne AustraliaAssociation for Computational Linguistics

Giannis Nikolentzos George Dasoulas and MichalisVazirgiannis 2019 k-hop graph neural networksCoRR abs190706051

Xiaoman Pan Kai Sun Dian Yu Heng Ji and Dong Yu2019 Improving question answering with externalknowledge CoRR abs190200993

Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2019 Exploring the limitsof transfer learning with a unified text-to-text trans-former

Adam Santoro David Raposo David G T BarrettMateusz Malinowski Razvan Pascanu Peter WBattaglia and Tim Lillicrap 2017 A simple neu-ral network module for relational reasoning In Ad-vances in Neural Information Processing Systems30 Annual Conference on Neural Information Pro-cessing Systems 2017 4-9 December 2017 LongBeach CA USA pages 4967ndash4976

Maarten Sap Hannah Rashkin Derek Chen RonanLe Bras and Yejin Choi 2019 Social IQa Com-monsense reasoning about social interactions InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the

9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) pages 4463ndash4473 Hong Kong China Association for Computa-tional Linguistics

Michael Sejr Schlichtkrull Thomas N Kipf PeterBloem Rianne van den Berg Ivan Titov and MaxWelling 2018 Modeling relational data with graphconvolutional networks In The Semantic Web - 15thInternational Conference ESWC 2018 HeraklionCrete Greece June 3-7 2018 Proceedings volume10843 of Lecture Notes in Computer Science pages593ndash607 Springer

Robyn Speer Joshua Chin and Catherine Havasi 2017Conceptnet 55 An open multilingual graph of gen-eral knowledge In Proceedings of the Thirty-FirstAAAI Conference on Artificial Intelligence Febru-ary 4-9 2017 San Francisco California USApages 4444ndash4451 AAAI Press

Alon Talmor Jonathan Herzig Nicholas Lourie andJonathan Berant 2019 CommonsenseQA A ques-tion answering challenge targeting commonsenseknowledge In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4149ndash4158 Minneapolis Minnesota Associ-ation for Computational Linguistics

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez LukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems 30 Annual Conference on NeuralInformation Processing Systems 2017 4-9 Decem-ber 2017 Long Beach CA USA pages 5998ndash6008

Petar Velickovic Guillem Cucurull Arantxa CasanovaAdriana Romero Pietro Lio and Yoshua Bengio2018 Graph attention networks In 6th Inter-national Conference on Learning RepresentationsICLR 2018 Vancouver BC Canada April 30 - May3 2018 Conference Track Proceedings OpenRe-viewnet

Shuohang Wang and Jing Jiang 2016 Learning nat-ural language inference with LSTM In Proceed-ings of the 2016 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics Human Language Technologies pages1442ndash1451 San Diego California Association forComputational Linguistics

Xiaoyan Wang Pavan Kapanipathi Ryan MusaMo Yu Kartik Talamadupula Ibrahim AbdelazizMaria Chang Achille Fokoue Bassem MakniNicholas Mattei and Michael Witbrock 2019 Im-proving natural language inference using externalknowledge in the science questions domain In TheThirty-Third AAAI Conference on Artificial Intelli-gence AAAI 2019 The Thirty-First Innovative Ap-plications of Artificial Intelligence Conference IAAI2019 The Ninth AAAI Symposium on Educational

Advances in Artificial Intelligence EAAI 2019 Hon-olulu Hawaii USA January 27 - February 1 2019pages 7208ndash7215 AAAI Press

Dirk Weissenborn Tomas Kocisky and Chris Dyer2017 Dynamic integration of background knowl-edge in neural nlu systems arXiv preprintarXiv170602596

An Yang Quan Wang Jing Liu Kai Liu Yajuan LyuHua Wu Qiaoqiao She and Sujian Li 2019 En-hancing pre-trained language representations withrich knowledge for machine reading comprehensionIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2346ndash2357 Florence Italy Association for Compu-tational Linguistics

Bishan Yang and Tom Mitchell 2017 Leveragingknowledge bases in LSTMs for improving machinereading In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1 Long Papers) pages 1436ndash1446 Van-couver Canada Association for Computational Lin-guistics

Zhi-Xiu Ye Qian Chen Wen Wang and Zhen-HuaLing 2019 Align mask and select A sim-ple method for incorporating commonsense knowl-edge into language representation models CoRRabs190806725

Yuyu Zhang Hanjun Dai Kamil Toraman andLe Song 2018 Kgˆ2 Learning to reason scienceexam questions with contextual knowledge graphembeddings CoRR abs180512393

Chen Zhu Yu Cheng Zhe Gan Siqi Sun Tom Gold-stein and Jingjing Liu 2020 Freelb Enhanced ad-versarial training for natural language understandingIn International Conference on Learning Represen-tations

A Merging Types of Relations inConceptNet

Relation Merged Relation

AtLocationAtLocation

LocatedNear

CausesCausesCausesDesire

MotivatedByGoal

AntonymAntonym

DistinctFrom

HasSubevent

HasSubevent

HasFirstSubeventHasLastSubeventHasPrerequisite

EntailsMannerOf

IsAIsAInstanceOf

DefinedAs

PartOfPartOf

HasA

RelatedToRelatedToSimilarTo

Synonym

Table 7 Relations in ConceptNet that are being mergedin pre-processing RelationX indicates the reverse re-lation of RelationX

We merge relations that are close in semanticsas well as in the general usage of triple instancesin ConceptNet

B Dataset Split Specifications

Train Dev Test

CommonsenseQA (OF) 9 741 1 221 1 140CommonsenseQA (IH) 8 500 1 221 1 241OpenbookQA 4 957 500 500

Table 8 Numbers of instances in different datasetsplits

CommonsenseQA 7 and OpenbookQA 8 all havetheir leaderboards with training and development

7httpswwwtau-nlporgcommonsenseqa8httpsleaderboardallenaiorgopen_

book_qasubmissionspublic

set publicly available As the ground truth labelsfor CommonsenseQA are not readily available formodel analysis we take 1241 examples from of-ficial training examples as our in-house test ex-amples and regard the remaining 8500 ones asour in-house training examples (CommonsenseQA(IH))

C Implementation Details

CommonsenseQA OpenbookQA

BERT-BASE 3 times 10minus5 -

BERT-LARGE 2 times 10minus5 -

ROBERTA-LARGE 1 times 10minus5

1 times 10minus5

Table 9 Learning rate for text encoders on differentdatasets

CommonsenseQA OpenbookQA

RN 3 times 10minus4

3 times 10minus4

RGCN 1 times 10minus3

1 times 10minus3

GconAttn 3 times 10minus4

1 times 10minus4

MHGRN 1 times 10minus3

1 times 10minus3

Table 10 Learning rate for graph encoders on differentdatasets

Param

RN 399KRGCN 365KGconAttn 453KMHGRN 544K

Table 11 Numbers of parameters of different graph en-coders

Our models are implemented in PyTorch We usecross-entropy loss and RAdam (Liu et al 2019a)optimizer We find it beneficial to use separatelearning rates for the text encoder and the graph en-coder We tune learning rates for text encoders andgraph encoders on two datasets We first fine-tuneROBERTA-LARGE BERT-LARGE BERT-BASE

on CommonsenseQA and ROBERTA-LARGE onOpenbookQA respectively and choose a dataset-specific learning rate from 1times10

minus5 2times10

minus5 3times

10minus5 6 times 10

minus5 1 times 10

minus4 for each text encoderbased on the best performance on development setas listed in Table 9 We report the performance ofthese fine-tuned text encoders and also adopt theirdataset-specific optimal learning rates in joint train-ing with graph encoders For models that involve

KG the learning rate of their graph encoders arechosen from 1 times 10

minus4 3 times 10

minus4 1 times 10

minus3 3 times

10minus3 based on their best development set perfor-

mance with ROBERTA-LARGE as the text encoderWe report the optimal learning rates for graph en-coders in Table 10 In training we set the max-imum input sequence length to text encoders to64 batch size to 32 and perform early stoppingAristoRoBERTaV7+MHGRN is the only exceptionIn order to host fair comparison we follow Aris-toRoBERTaV7 and set the batch size to 16 maxinput sequence length to 256 and choose a decoderlearning rate from 1 times 10

minus3 2 times 10

minus5For the input node features we first use tem-

plates to turn knowledge triples in ConceptNet intosentences and feed them into pre-trained BERT-LARGE obtaining a sequence of token embeddingsfrom the last layer of BERT-LARGE for each tripleFor each entity in ConceptNet we perform meanpooling over the tokens of the entityrsquos occurrencesacross all the sentences to form a 1024d vector asits corresponding node feature We use this set offeatures for all our implemented models

We use 2-layer RGCN and single-layer MHGRNacross our experiments

The numbers of parameter for each graph en-coder are listed in Table 11

D Dynamic Programming Algorithm forEq 7

To show that multi-hop message passing can becomputed in linear time we observe that Eq 7 canbe re-written in matrix form

Zk= (Dk)minus1 sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FXW1r1

⊤⋯W

krk

sdotWk+10

⊤⋯W

K0

⊤ (1 le k le K) (13)

where G = diag(exp([g(φ(v1) s) g(φ(vn)s)]) (F is similarly defined) Ar is the adjacencymatrix for relation r and D

k is defined as follows

Dk= diag( sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FX1) (1 le k le K) (14)

Using this matrix formulation we can compute Eq7 using dynamic programming

Algorithm 1 Dynamic programming algorithm formulti-hop message passingInput sXAr(1 le r le m)W t

r (r isin R 1 le t le

k)F G δ τOutput Z

1 WKlarr I

2 for k larr K minus 1 to 1 do3 W

klarrW

k+10 W

k+1

4 end for5 for r isin R do6 Mr larr FX7 end for8 for k larr 1 to K do9 if k gt 1 then

10 for r isin R do11 M

primer larr e

δ(rs)ArsumrprimeisinR e

τ(rprimers)MrprimeW

kr

12 end for13 for r isin R do14 Mr larrM

primer

15 end for16 else17 for r isin R do18 Mr larr e

δ(rs) sdotArMrWkr

19 end for20 end if21 Z

klarr GsumrisinR MrW

k

22 end for23 Replace W

tr (0 le r le m 1 le t le k) with identity

matrices and X with 1 and re-run line 1 - line 19 tocompute d

1 d

K

24 for k larr 1 to K do25 Z

klarr (diag(dk))minus1

Zk

26 end for27 return Z

1Z

2 Z

k

E Formal Definition of K-hop RN

Definition 1 (K-hop Relation Network) A multi-hop relation network is a function that maps amulti-relational graph to a fixed size vector

KHopRN(G W E H) =K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)sdotW (hjoplus(er1⋯erk)oplushi)(15)

where denotes element-wise product and β(⋯) =1(K∣A∣ sdot ∣(j prime i) isin G ∣ j prime isin Q∣) defines thepooling weights

F Expressing K-hop RN with MultiGRN

Theorem 1 Given any W E H there exists a pa-rameter setting such that the output of the modelbecomes KHopRN(GW E H) for arbitrary G

Proof Suppose W = [W1 W2 W3] whereW1 W3 isin Rd3timesd1 W2 isin Rd3timesd2 ForMHRGN we set the parameters as followsH = HUlowast = [I0] isin R(d1+d2)timesd1 blowast =

[01]⊤ isin Rd1+d2 Wtr = diag(1 oplus er) isin

R(d1+d2)times(d1+d2)(r isin R 1 le t le K)V = W3 isin

Rd3timesd1 Vprime= [W1 W2] isin Rd3times(d1+d2) We dis-

able the relation type attention module and enablemessage passing only from Q to A By furtherchoosing σ as the identity function and perform-ing pooling over A we observe that the output of

MultiGRN becomes

1

∣A∣ sumiisinA

hprimei

=1

∣A∣ (V hi + Vprimezi)

=1

K∣A∣K

sumk=1

(V hi + Vprimezki )

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)(V hi+

VprimeW

krk⋯W

1r1xj)

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ

β(⋯)(V hi + VprimeW

krk

⋯W1r1Uφ(j)hj + V

primeW

krk⋯W

1r1bφ(j))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)(W3hi + W1hj

+ W2(er1 ⋯ erk))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)W (hjoplus

(er1 ⋯ erk)oplus hi)= RN(G W E H)

(16)

Page 7: arXiv:2005.00646v1 [cs.CL] 1 May 20200 80 #node 3.1e3 #path K=2 0 80 #node 6.3e4 K=2 K=3 Figure 2: Number of K-hop relational paths w.r.t. the node count in extracted graphs on Common-senseQA

Methods Dev () Test ()

T5-3Bdagger (Raffel et al 2019) - 8320UnifiedQAdagger (Khashabi et al 2020) - 8720

RoBERTa-Large (wo KG) 6676 (plusmn114) 6480 (plusmn237)+ RGCN 6465 (plusmn196) 6245 (plusmn157)+ GconAttn 6430 (plusmn099) 6190 (plusmn244)+ RN (1-hop) 6485 (plusmn111) 6365 (plusmn231)+ RN (2-hop) 6700 (plusmn071) 6520 (plusmn118)+ MHGRN (K = 3) 6810 (plusmn102) 6685 (plusmn119)

AristoRoBERTaV7dagger 792 778+ MHGRN (K = 3) 786 806

Table 5 Performance comparison on OpenbookQAdagger indicates reported results on leaderboard T5-3B is 8times larger than our models UnifiedQA is 30x larger

61 Main ResultsFor CommonsenseQA (Table 3) we first use thein-house data split (IH) of Lin et al (2019) (cfAppendix B) to compare our models with imple-mented baselines This is different from the offi-cial split used in the leaderboard methods Almostall KG-augmented models achieve performancegain over vanilla pre-trained LMs demonstratingthe value of external knowledge on this dataset Ad-ditionally we evaluate our MHGRN (with text en-coder being ROBERTA-LARGE) on official splitOF (Table 4) for fair comparison with other meth-ods on leaderboard in both single-model settingand ensemble-model setting With backbone be-ing T5 (Raffel et al 2019) UnifiedQA (Khashabiet al 2020) tops the leaderboard Consideringits training cost we do not intend to compare ourROBERTA-based model with it We achieve thebest performances among other models

For OpenbookQA (Table 5) we use official splitand build models with ROBERTA-LARGE as textencoder MHGRN surpasses all implemented base-lines with an absolute increase of sim2 on TestAlso as our approach is naturally compatible withthe methods that utilize textual knowledge or ex-tra data because in our paradigm the encodingof textual statement and graph are structurally-decoupled (Fig 3) To empirically show MHGRNcan bring gain over textual-knowledge empow-ered systems we replace our text encoder withAristoRoBERTaV76 and fine-tune our MHGRNupon OpenbookQA Empirically MHGRN contin-ues to bring benefit to strong-performing textual-knowledge empowered systems This indicatestextual knowledge and structured knowledge canpotentially be complementary

6httpsleaderboardallenaiorgopen_book_qasubmissionblcp1tu91i4gm0vf484g

Methods IHdev-Acc ()

MHGRN (K = 3) 7445 (plusmn010)- Type-specific transformation (sect41) 7316 (plusmn028)- Structured relational attention (sect42) 7326 (plusmn031)- Relation type attention (sect42) 7355 (plusmn068)- Node type attention (sect42) 7392 (plusmn065)

Table 6 Ablation study on model components (re-moving one component each time) using ROBERTA-LARGE as the text encoder We report the IHdev ac-curacy on CommonsenseQA

62 Performance Analysis

Ablation Study on Model Components Asshown in Table 6 disabling type-specific trans-formation results in sim 13 drop in performancedemonstrating the need for distinguishing nodetypes for QA tasks Our structured relational at-tention mechanism is also critical with its twosub-components contributing almost equally

Impact of the Amount of Training Data Weuse different fractions of training data of Common-senseQA and report results of fine-tuning text en-coders alone and jointly training text encoder andgraph encoder in Fig 5 Regardless of training datafraction our model shows consistently more per-formance improvement over knowledge-agnosticfine-tuning compared with the other graph encod-ing methods indicating MHGRNrsquos complementarystrengths to text encoders

20 40 60 80 100Percentage of Training Data

45

50

55

60

65

Accu

racy

()

MultiGRNKVMGconAttnwo KG

Figure 5 Performance change (accuracy in ) wrtthe amounts of training data on CommonsenseQA IHT-est set (same as Lin et al (2019))

Impact of Number of Hops (K) We investigatethe impact of hyperparameter K for MHGRN byits performance on CommonsenseQA (Fig 6) Theincrease ofK continues to bring benefits untilK =

4 However performance begins to drop whenK gt 3 This might be attributed to exponentialnoise in longer relational paths in knowledge graph

65 70 75Accuracy ()

K=1

K=2

K=3

K=4

K=5

Reasoning Hops

7324

7393

7465

7365

7416

Figure 6 Effect ofK in MHGRN We show IHDev ac-curacy of MHGRN on CommonsenseQA wrt hops

2 4 6 800

05

10

15

20

sba

tch

K Reasoning Hops

RGCNMHGRN

Figure 7 Analysis of model scalability Comparisonof per-batch training efficiency wrt hops K

63 Model ScalabilityFig 7 presents the computation cost of MHGRNand RGCN (measured by training time) Both growlinearly wrt K Although the theoretical complex-ity of MultiRGN ism times that of RGCN the ratioof their empirical cost only approaches 2 demon-strating that our model can be better parallelized

64 Model InterpretabilityWe can analyze our modelrsquos reasoning process bydecoding the reasoning path using the method de-scribed in sect45 Fig 8 shows two examples fromCommonsenseQA where our model correctly an-swers the questions and provides reasonable pathevidences In the example on the left the modellinks question entities and answer entity in a chainto support reasoning while the example on the rightprovides a case where our model leverage unmen-tioned entities to bridge the reasoning gap betweenquestion entity and answer entities in a way that iscoherent with the latent relation between CHAPEL

and the desired answer in the question

7 Related WorkKnowledge-Aware Methods for NLP Variouswork have investigated the potential to empowerNLP models with external knowledge Many at-tempt to extract structured knowledge either in theform of nodes (Yang and Mitchell 2017 Wanget al 2019) triples (Weissenborn et al 2017

Where is known for a multitude of wedding chapels

A town B texasC city D church building E Nevada

RenoChapel

PartOf

AtLocation

UsedForminus1 Synonym

AtLocation Synonym

Neveda

Baseball

Play_Baseball

HasProperty

UsedForminus1

Fun_to_Play

Why do parents encourage their kids to play baseball

A round B cheap C break window D hardE fun to play Fun_to_Play

HasProperty

Reno

Chapel

AtLocation

Neveda

PartOf

MultiGRNFigure 8 Case study on model interpretability Wepresent two sample questions from CommonsenseQAwith the reasoning paths output by MHGRN

Mihaylov and Frank 2018) paths (Bauer et al2018 Kundu et al 2019 Lin et al 2019) or sub-graphs (Li and Clark 2015) and encode them toaugment textual understanding

Recent success of pre-trained LMs motivatesmany (Pan et al 2019 Ye et al 2019 Zhang et al2018 Li et al 2019 Banerjee et al 2019) to probeLMsrsquo potential as latent knowledge bases This lineof work turn to textual knowledge (eg Wikipedia)to directly impart knowledge to pre-trained LMsThey generally fall into two paradigms 1) Fine-tuning LMs on large-scale general-domain datasets(eg RACE (Lai et al 2017)) or on knowledge-richtext 2) Providing LMs with evidence via informa-tion retrieval techniques However these modelscannot provide explicit reasoning and evidencethus hardly trustworthy They are also subject tothe availability of in-domain datasets and maxi-mum input token of pre-trained LMs

Neural Graph Encoding Graph Attention Net-works (GAT) (Velickovic et al 2018) incorpo-rates attention mechanism in feature aggregationRGCN (Schlichtkrull et al 2018) proposes rela-tional message passing which makes it applicableto multi-relational graphs However they only per-form single-hop message passing and cannot beinterpreted at path level Other work (Abu-El-Haijaet al 2019 Nikolentzos et al 2019) aggregate fora node its K-hop neighbors based on node-wisedistances but they are designed for non-relationalgraphs MHGRN addresses these issues by rea-soning on multi-relational graphs and being inter-pretable via maintaining paths as reasoning chains

8 Conclusion

We present a principled scalable method MHGRNthat can leverage general knowledge via multi-hopreasoning over interpretable structures (eg Con-ceptNet) The proposed MHGRN generalizes andcombines the advantages of GNNs and path-basedreasoning models It explicitly performs multi-hop

relational reasoning and is empirically shown tooutperform existing methods with superior scal-ablility and interpretability

ReferencesSami Abu-El-Haija Bryan Perozzi Amol Kapoor

Nazanin Alipourfard Kristina Lerman Hrayr Haru-tyunyan Greg Ver Steeg and Aram Galstyan 2019Mixhop Higher-order graph convolutional architec-tures via sparsified neighborhood mixing In Pro-ceedings of the 36th International Conference onMachine Learning ICML 2019 9-15 June 2019Long Beach California USA volume 97 of Pro-ceedings of Machine Learning Research pages 21ndash29 PMLR

Pratyay Banerjee Kuntal Kumar Pal Arindam Mi-tra and Chitta Baral 2019 Careful selection ofknowledge to solve open book question answeringIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages6120ndash6129 Florence Italy Association for Compu-tational Linguistics

Lisa Bauer Yicheng Wang and Mohit Bansal 2018Commonsense for generative multi-hop question an-swering tasks In Proceedings of the 2018 Confer-ence on Empirical Methods in Natural LanguageProcessing Brussels Belgium October 31 - Novem-ber 4 2018 pages 4220ndash4230 Association for Com-putational Linguistics

Peter Clark Isaac Cowhey Oren Etzioni Tushar KhotAshish Sabharwal Carissa Schoenick and OyvindTafjord 2018 Think you have solved questionanswering try arc the AI2 reasoning challengeCoRR abs180305457

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

Justin Gilmer Samuel S Schoenholz Patrick F RileyOriol Vinyals and George E Dahl 2017 Neuralmessage passing for quantum chemistry In Pro-ceedings of the 34th International Conference onMachine Learning ICML 2017 Sydney NSW Aus-tralia 6-11 August 2017 volume 70 of Proceedingsof Machine Learning Research pages 1263ndash1272PMLR

Daniel Khashabi Tushar Khot Ashish SabharwalOyvind Tafjord Peter Clark and Hannaneh Ha-jishirzi 2020 Unifiedqa Crossing format bound-aries with a single qa system

Thomas N Kipf and Max Welling 2017 Semi-supervised classification with graph convolutionalnetworks In 5th International Conference on Learn-ing Representations ICLR 2017 Toulon FranceApril 24-26 2017 Conference Track ProceedingsOpenReviewnet

Souvik Kundu Tushar Khot Ashish Sabharwal andPeter Clark 2019 Exploiting explicit paths formulti-hop reading comprehension In Proceedingsof the 57th Annual Meeting of the Association forComputational Linguistics pages 2737ndash2747 Flo-rence Italy Association for Computational Linguis-tics

John D Lafferty Andrew McCallum and FernandoC N Pereira 2001 Conditional random fieldsProbabilistic models for segmenting and labeling se-quence data In Proceedings of the Eighteenth Inter-national Conference on Machine Learning (ICML2001) Williams College Williamstown MA USAJune 28 - July 1 2001 pages 282ndash289 MorganKaufmann

Guokun Lai Qizhe Xie Hanxiao Liu Yiming Yangand Eduard Hovy 2017 RACE Large-scale ReAd-ing comprehension dataset from examinations InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing pages785ndash794 Copenhagen Denmark Association forComputational Linguistics

Zhen-Zhong Lan Mingda Chen Sebastian Good-man Kevin Gimpel Piyush Sharma and RaduSoricut 2019 Albert A lite bert for self-supervised learning of language representationsArXiv abs190911942

Shiyang Li Jianshu Chen and Dian Yu 2019 Teach-ing pretrained models with commonsense reason-ing A preliminary kb-based approach CoRRabs190909743

Yang Li and Peter Clark 2015 Answering elementaryscience questions by constructing coherent scenesusing background knowledge In Proceedings ofthe 2015 Conference on Empirical Methods in Nat-ural Language Processing pages 2007ndash2012 Lis-bon Portugal Association for Computational Lin-guistics

Bill Yuchen Lin Xinyue Chen Jamin Chen and Xi-ang Ren 2019 KagNet Knowledge-aware graphnetworks for commonsense reasoning In Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) pages 2829ndash2839 HongKong China Association for Computational Lin-guistics

Liyuan Liu Haoming Jiang Pengcheng He WeizhuChen Xiaodong Liu Jianfeng Gao and Jiawei Han2019a On the variance of the adaptive learning rateand beyond CoRR abs190803265

Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019bRoberta A robustly optimized BERT pretraining ap-proach CoRR abs190711692

Shangwen Lv Daya Guo Jingjing Xu Duyu TangNan Duan Ming Gong Linjun Shou Daxin JiangGuihong Cao and Songlin Hu 2019 Graph-based reasoning over heterogeneous external knowl-edge for commonsense question answering CoRRabs190905311

Kaixin Ma Jonathan Francis Quanyang Lu Eric Ny-berg and Alessandro Oltramari 2019 Towardsgeneralizable neuro-symbolic systems for common-sense question answering In Proceedings of theFirst Workshop on Commonsense Inference in Natu-ral Language Processing pages 22ndash32 Hong KongChina Association for Computational Linguistics

Todor Mihaylov Peter Clark Tushar Khot and AshishSabharwal 2018 Can a suit of armor conduct elec-tricity a new dataset for open book question an-swering In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processingpages 2381ndash2391 Brussels Belgium Associationfor Computational Linguistics

Todor Mihaylov and Anette Frank 2018 Knowledge-able reader Enhancing cloze-style reading compre-hension with external commonsense knowledge InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 821ndash832 Melbourne AustraliaAssociation for Computational Linguistics

Giannis Nikolentzos George Dasoulas and MichalisVazirgiannis 2019 k-hop graph neural networksCoRR abs190706051

Xiaoman Pan Kai Sun Dian Yu Heng Ji and Dong Yu2019 Improving question answering with externalknowledge CoRR abs190200993

Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2019 Exploring the limitsof transfer learning with a unified text-to-text trans-former

Adam Santoro David Raposo David G T BarrettMateusz Malinowski Razvan Pascanu Peter WBattaglia and Tim Lillicrap 2017 A simple neu-ral network module for relational reasoning In Ad-vances in Neural Information Processing Systems30 Annual Conference on Neural Information Pro-cessing Systems 2017 4-9 December 2017 LongBeach CA USA pages 4967ndash4976

Maarten Sap Hannah Rashkin Derek Chen RonanLe Bras and Yejin Choi 2019 Social IQa Com-monsense reasoning about social interactions InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the

9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) pages 4463ndash4473 Hong Kong China Association for Computa-tional Linguistics

Michael Sejr Schlichtkrull Thomas N Kipf PeterBloem Rianne van den Berg Ivan Titov and MaxWelling 2018 Modeling relational data with graphconvolutional networks In The Semantic Web - 15thInternational Conference ESWC 2018 HeraklionCrete Greece June 3-7 2018 Proceedings volume10843 of Lecture Notes in Computer Science pages593ndash607 Springer

Robyn Speer Joshua Chin and Catherine Havasi 2017Conceptnet 55 An open multilingual graph of gen-eral knowledge In Proceedings of the Thirty-FirstAAAI Conference on Artificial Intelligence Febru-ary 4-9 2017 San Francisco California USApages 4444ndash4451 AAAI Press

Alon Talmor Jonathan Herzig Nicholas Lourie andJonathan Berant 2019 CommonsenseQA A ques-tion answering challenge targeting commonsenseknowledge In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4149ndash4158 Minneapolis Minnesota Associ-ation for Computational Linguistics

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez LukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems 30 Annual Conference on NeuralInformation Processing Systems 2017 4-9 Decem-ber 2017 Long Beach CA USA pages 5998ndash6008

Petar Velickovic Guillem Cucurull Arantxa CasanovaAdriana Romero Pietro Lio and Yoshua Bengio2018 Graph attention networks In 6th Inter-national Conference on Learning RepresentationsICLR 2018 Vancouver BC Canada April 30 - May3 2018 Conference Track Proceedings OpenRe-viewnet

Shuohang Wang and Jing Jiang 2016 Learning nat-ural language inference with LSTM In Proceed-ings of the 2016 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics Human Language Technologies pages1442ndash1451 San Diego California Association forComputational Linguistics

Xiaoyan Wang Pavan Kapanipathi Ryan MusaMo Yu Kartik Talamadupula Ibrahim AbdelazizMaria Chang Achille Fokoue Bassem MakniNicholas Mattei and Michael Witbrock 2019 Im-proving natural language inference using externalknowledge in the science questions domain In TheThirty-Third AAAI Conference on Artificial Intelli-gence AAAI 2019 The Thirty-First Innovative Ap-plications of Artificial Intelligence Conference IAAI2019 The Ninth AAAI Symposium on Educational

Advances in Artificial Intelligence EAAI 2019 Hon-olulu Hawaii USA January 27 - February 1 2019pages 7208ndash7215 AAAI Press

Dirk Weissenborn Tomas Kocisky and Chris Dyer2017 Dynamic integration of background knowl-edge in neural nlu systems arXiv preprintarXiv170602596

An Yang Quan Wang Jing Liu Kai Liu Yajuan LyuHua Wu Qiaoqiao She and Sujian Li 2019 En-hancing pre-trained language representations withrich knowledge for machine reading comprehensionIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2346ndash2357 Florence Italy Association for Compu-tational Linguistics

Bishan Yang and Tom Mitchell 2017 Leveragingknowledge bases in LSTMs for improving machinereading In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1 Long Papers) pages 1436ndash1446 Van-couver Canada Association for Computational Lin-guistics

Zhi-Xiu Ye Qian Chen Wen Wang and Zhen-HuaLing 2019 Align mask and select A sim-ple method for incorporating commonsense knowl-edge into language representation models CoRRabs190806725

Yuyu Zhang Hanjun Dai Kamil Toraman andLe Song 2018 Kgˆ2 Learning to reason scienceexam questions with contextual knowledge graphembeddings CoRR abs180512393

Chen Zhu Yu Cheng Zhe Gan Siqi Sun Tom Gold-stein and Jingjing Liu 2020 Freelb Enhanced ad-versarial training for natural language understandingIn International Conference on Learning Represen-tations

A Merging Types of Relations inConceptNet

Relation Merged Relation

AtLocationAtLocation

LocatedNear

CausesCausesCausesDesire

MotivatedByGoal

AntonymAntonym

DistinctFrom

HasSubevent

HasSubevent

HasFirstSubeventHasLastSubeventHasPrerequisite

EntailsMannerOf

IsAIsAInstanceOf

DefinedAs

PartOfPartOf

HasA

RelatedToRelatedToSimilarTo

Synonym

Table 7 Relations in ConceptNet that are being mergedin pre-processing RelationX indicates the reverse re-lation of RelationX

We merge relations that are close in semanticsas well as in the general usage of triple instancesin ConceptNet

B Dataset Split Specifications

Train Dev Test

CommonsenseQA (OF) 9 741 1 221 1 140CommonsenseQA (IH) 8 500 1 221 1 241OpenbookQA 4 957 500 500

Table 8 Numbers of instances in different datasetsplits

CommonsenseQA 7 and OpenbookQA 8 all havetheir leaderboards with training and development

7httpswwwtau-nlporgcommonsenseqa8httpsleaderboardallenaiorgopen_

book_qasubmissionspublic

set publicly available As the ground truth labelsfor CommonsenseQA are not readily available formodel analysis we take 1241 examples from of-ficial training examples as our in-house test ex-amples and regard the remaining 8500 ones asour in-house training examples (CommonsenseQA(IH))

C Implementation Details

CommonsenseQA OpenbookQA

BERT-BASE 3 times 10minus5 -

BERT-LARGE 2 times 10minus5 -

ROBERTA-LARGE 1 times 10minus5

1 times 10minus5

Table 9 Learning rate for text encoders on differentdatasets

CommonsenseQA OpenbookQA

RN 3 times 10minus4

3 times 10minus4

RGCN 1 times 10minus3

1 times 10minus3

GconAttn 3 times 10minus4

1 times 10minus4

MHGRN 1 times 10minus3

1 times 10minus3

Table 10 Learning rate for graph encoders on differentdatasets

Param

RN 399KRGCN 365KGconAttn 453KMHGRN 544K

Table 11 Numbers of parameters of different graph en-coders

Our models are implemented in PyTorch We usecross-entropy loss and RAdam (Liu et al 2019a)optimizer We find it beneficial to use separatelearning rates for the text encoder and the graph en-coder We tune learning rates for text encoders andgraph encoders on two datasets We first fine-tuneROBERTA-LARGE BERT-LARGE BERT-BASE

on CommonsenseQA and ROBERTA-LARGE onOpenbookQA respectively and choose a dataset-specific learning rate from 1times10

minus5 2times10

minus5 3times

10minus5 6 times 10

minus5 1 times 10

minus4 for each text encoderbased on the best performance on development setas listed in Table 9 We report the performance ofthese fine-tuned text encoders and also adopt theirdataset-specific optimal learning rates in joint train-ing with graph encoders For models that involve

KG the learning rate of their graph encoders arechosen from 1 times 10

minus4 3 times 10

minus4 1 times 10

minus3 3 times

10minus3 based on their best development set perfor-

mance with ROBERTA-LARGE as the text encoderWe report the optimal learning rates for graph en-coders in Table 10 In training we set the max-imum input sequence length to text encoders to64 batch size to 32 and perform early stoppingAristoRoBERTaV7+MHGRN is the only exceptionIn order to host fair comparison we follow Aris-toRoBERTaV7 and set the batch size to 16 maxinput sequence length to 256 and choose a decoderlearning rate from 1 times 10

minus3 2 times 10

minus5For the input node features we first use tem-

plates to turn knowledge triples in ConceptNet intosentences and feed them into pre-trained BERT-LARGE obtaining a sequence of token embeddingsfrom the last layer of BERT-LARGE for each tripleFor each entity in ConceptNet we perform meanpooling over the tokens of the entityrsquos occurrencesacross all the sentences to form a 1024d vector asits corresponding node feature We use this set offeatures for all our implemented models

We use 2-layer RGCN and single-layer MHGRNacross our experiments

The numbers of parameter for each graph en-coder are listed in Table 11

D Dynamic Programming Algorithm forEq 7

To show that multi-hop message passing can becomputed in linear time we observe that Eq 7 canbe re-written in matrix form

Zk= (Dk)minus1 sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FXW1r1

⊤⋯W

krk

sdotWk+10

⊤⋯W

K0

⊤ (1 le k le K) (13)

where G = diag(exp([g(φ(v1) s) g(φ(vn)s)]) (F is similarly defined) Ar is the adjacencymatrix for relation r and D

k is defined as follows

Dk= diag( sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FX1) (1 le k le K) (14)

Using this matrix formulation we can compute Eq7 using dynamic programming

Algorithm 1 Dynamic programming algorithm formulti-hop message passingInput sXAr(1 le r le m)W t

r (r isin R 1 le t le

k)F G δ τOutput Z

1 WKlarr I

2 for k larr K minus 1 to 1 do3 W

klarrW

k+10 W

k+1

4 end for5 for r isin R do6 Mr larr FX7 end for8 for k larr 1 to K do9 if k gt 1 then

10 for r isin R do11 M

primer larr e

δ(rs)ArsumrprimeisinR e

τ(rprimers)MrprimeW

kr

12 end for13 for r isin R do14 Mr larrM

primer

15 end for16 else17 for r isin R do18 Mr larr e

δ(rs) sdotArMrWkr

19 end for20 end if21 Z

klarr GsumrisinR MrW

k

22 end for23 Replace W

tr (0 le r le m 1 le t le k) with identity

matrices and X with 1 and re-run line 1 - line 19 tocompute d

1 d

K

24 for k larr 1 to K do25 Z

klarr (diag(dk))minus1

Zk

26 end for27 return Z

1Z

2 Z

k

E Formal Definition of K-hop RN

Definition 1 (K-hop Relation Network) A multi-hop relation network is a function that maps amulti-relational graph to a fixed size vector

KHopRN(G W E H) =K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)sdotW (hjoplus(er1⋯erk)oplushi)(15)

where denotes element-wise product and β(⋯) =1(K∣A∣ sdot ∣(j prime i) isin G ∣ j prime isin Q∣) defines thepooling weights

F Expressing K-hop RN with MultiGRN

Theorem 1 Given any W E H there exists a pa-rameter setting such that the output of the modelbecomes KHopRN(GW E H) for arbitrary G

Proof Suppose W = [W1 W2 W3] whereW1 W3 isin Rd3timesd1 W2 isin Rd3timesd2 ForMHRGN we set the parameters as followsH = HUlowast = [I0] isin R(d1+d2)timesd1 blowast =

[01]⊤ isin Rd1+d2 Wtr = diag(1 oplus er) isin

R(d1+d2)times(d1+d2)(r isin R 1 le t le K)V = W3 isin

Rd3timesd1 Vprime= [W1 W2] isin Rd3times(d1+d2) We dis-

able the relation type attention module and enablemessage passing only from Q to A By furtherchoosing σ as the identity function and perform-ing pooling over A we observe that the output of

MultiGRN becomes

1

∣A∣ sumiisinA

hprimei

=1

∣A∣ (V hi + Vprimezi)

=1

K∣A∣K

sumk=1

(V hi + Vprimezki )

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)(V hi+

VprimeW

krk⋯W

1r1xj)

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ

β(⋯)(V hi + VprimeW

krk

⋯W1r1Uφ(j)hj + V

primeW

krk⋯W

1r1bφ(j))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)(W3hi + W1hj

+ W2(er1 ⋯ erk))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)W (hjoplus

(er1 ⋯ erk)oplus hi)= RN(G W E H)

(16)

Page 8: arXiv:2005.00646v1 [cs.CL] 1 May 20200 80 #node 3.1e3 #path K=2 0 80 #node 6.3e4 K=2 K=3 Figure 2: Number of K-hop relational paths w.r.t. the node count in extracted graphs on Common-senseQA

65 70 75Accuracy ()

K=1

K=2

K=3

K=4

K=5

Reasoning Hops

7324

7393

7465

7365

7416

Figure 6 Effect ofK in MHGRN We show IHDev ac-curacy of MHGRN on CommonsenseQA wrt hops

2 4 6 800

05

10

15

20

sba

tch

K Reasoning Hops

RGCNMHGRN

Figure 7 Analysis of model scalability Comparisonof per-batch training efficiency wrt hops K

63 Model ScalabilityFig 7 presents the computation cost of MHGRNand RGCN (measured by training time) Both growlinearly wrt K Although the theoretical complex-ity of MultiRGN ism times that of RGCN the ratioof their empirical cost only approaches 2 demon-strating that our model can be better parallelized

64 Model InterpretabilityWe can analyze our modelrsquos reasoning process bydecoding the reasoning path using the method de-scribed in sect45 Fig 8 shows two examples fromCommonsenseQA where our model correctly an-swers the questions and provides reasonable pathevidences In the example on the left the modellinks question entities and answer entity in a chainto support reasoning while the example on the rightprovides a case where our model leverage unmen-tioned entities to bridge the reasoning gap betweenquestion entity and answer entities in a way that iscoherent with the latent relation between CHAPEL

and the desired answer in the question

7 Related WorkKnowledge-Aware Methods for NLP Variouswork have investigated the potential to empowerNLP models with external knowledge Many at-tempt to extract structured knowledge either in theform of nodes (Yang and Mitchell 2017 Wanget al 2019) triples (Weissenborn et al 2017

Where is known for a multitude of wedding chapels

A town B texasC city D church building E Nevada

RenoChapel

PartOf

AtLocation

UsedForminus1 Synonym

AtLocation Synonym

Neveda

Baseball

Play_Baseball

HasProperty

UsedForminus1

Fun_to_Play

Why do parents encourage their kids to play baseball

A round B cheap C break window D hardE fun to play Fun_to_Play

HasProperty

Reno

Chapel

AtLocation

Neveda

PartOf

MultiGRNFigure 8 Case study on model interpretability Wepresent two sample questions from CommonsenseQAwith the reasoning paths output by MHGRN

Mihaylov and Frank 2018) paths (Bauer et al2018 Kundu et al 2019 Lin et al 2019) or sub-graphs (Li and Clark 2015) and encode them toaugment textual understanding

Recent success of pre-trained LMs motivatesmany (Pan et al 2019 Ye et al 2019 Zhang et al2018 Li et al 2019 Banerjee et al 2019) to probeLMsrsquo potential as latent knowledge bases This lineof work turn to textual knowledge (eg Wikipedia)to directly impart knowledge to pre-trained LMsThey generally fall into two paradigms 1) Fine-tuning LMs on large-scale general-domain datasets(eg RACE (Lai et al 2017)) or on knowledge-richtext 2) Providing LMs with evidence via informa-tion retrieval techniques However these modelscannot provide explicit reasoning and evidencethus hardly trustworthy They are also subject tothe availability of in-domain datasets and maxi-mum input token of pre-trained LMs

Neural Graph Encoding Graph Attention Net-works (GAT) (Velickovic et al 2018) incorpo-rates attention mechanism in feature aggregationRGCN (Schlichtkrull et al 2018) proposes rela-tional message passing which makes it applicableto multi-relational graphs However they only per-form single-hop message passing and cannot beinterpreted at path level Other work (Abu-El-Haijaet al 2019 Nikolentzos et al 2019) aggregate fora node its K-hop neighbors based on node-wisedistances but they are designed for non-relationalgraphs MHGRN addresses these issues by rea-soning on multi-relational graphs and being inter-pretable via maintaining paths as reasoning chains

8 Conclusion

We present a principled scalable method MHGRNthat can leverage general knowledge via multi-hopreasoning over interpretable structures (eg Con-ceptNet) The proposed MHGRN generalizes andcombines the advantages of GNNs and path-basedreasoning models It explicitly performs multi-hop

relational reasoning and is empirically shown tooutperform existing methods with superior scal-ablility and interpretability

ReferencesSami Abu-El-Haija Bryan Perozzi Amol Kapoor

Nazanin Alipourfard Kristina Lerman Hrayr Haru-tyunyan Greg Ver Steeg and Aram Galstyan 2019Mixhop Higher-order graph convolutional architec-tures via sparsified neighborhood mixing In Pro-ceedings of the 36th International Conference onMachine Learning ICML 2019 9-15 June 2019Long Beach California USA volume 97 of Pro-ceedings of Machine Learning Research pages 21ndash29 PMLR

Pratyay Banerjee Kuntal Kumar Pal Arindam Mi-tra and Chitta Baral 2019 Careful selection ofknowledge to solve open book question answeringIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages6120ndash6129 Florence Italy Association for Compu-tational Linguistics

Lisa Bauer Yicheng Wang and Mohit Bansal 2018Commonsense for generative multi-hop question an-swering tasks In Proceedings of the 2018 Confer-ence on Empirical Methods in Natural LanguageProcessing Brussels Belgium October 31 - Novem-ber 4 2018 pages 4220ndash4230 Association for Com-putational Linguistics

Peter Clark Isaac Cowhey Oren Etzioni Tushar KhotAshish Sabharwal Carissa Schoenick and OyvindTafjord 2018 Think you have solved questionanswering try arc the AI2 reasoning challengeCoRR abs180305457

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

Justin Gilmer Samuel S Schoenholz Patrick F RileyOriol Vinyals and George E Dahl 2017 Neuralmessage passing for quantum chemistry In Pro-ceedings of the 34th International Conference onMachine Learning ICML 2017 Sydney NSW Aus-tralia 6-11 August 2017 volume 70 of Proceedingsof Machine Learning Research pages 1263ndash1272PMLR

Daniel Khashabi Tushar Khot Ashish SabharwalOyvind Tafjord Peter Clark and Hannaneh Ha-jishirzi 2020 Unifiedqa Crossing format bound-aries with a single qa system

Thomas N Kipf and Max Welling 2017 Semi-supervised classification with graph convolutionalnetworks In 5th International Conference on Learn-ing Representations ICLR 2017 Toulon FranceApril 24-26 2017 Conference Track ProceedingsOpenReviewnet

Souvik Kundu Tushar Khot Ashish Sabharwal andPeter Clark 2019 Exploiting explicit paths formulti-hop reading comprehension In Proceedingsof the 57th Annual Meeting of the Association forComputational Linguistics pages 2737ndash2747 Flo-rence Italy Association for Computational Linguis-tics

John D Lafferty Andrew McCallum and FernandoC N Pereira 2001 Conditional random fieldsProbabilistic models for segmenting and labeling se-quence data In Proceedings of the Eighteenth Inter-national Conference on Machine Learning (ICML2001) Williams College Williamstown MA USAJune 28 - July 1 2001 pages 282ndash289 MorganKaufmann

Guokun Lai Qizhe Xie Hanxiao Liu Yiming Yangand Eduard Hovy 2017 RACE Large-scale ReAd-ing comprehension dataset from examinations InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing pages785ndash794 Copenhagen Denmark Association forComputational Linguistics

Zhen-Zhong Lan Mingda Chen Sebastian Good-man Kevin Gimpel Piyush Sharma and RaduSoricut 2019 Albert A lite bert for self-supervised learning of language representationsArXiv abs190911942

Shiyang Li Jianshu Chen and Dian Yu 2019 Teach-ing pretrained models with commonsense reason-ing A preliminary kb-based approach CoRRabs190909743

Yang Li and Peter Clark 2015 Answering elementaryscience questions by constructing coherent scenesusing background knowledge In Proceedings ofthe 2015 Conference on Empirical Methods in Nat-ural Language Processing pages 2007ndash2012 Lis-bon Portugal Association for Computational Lin-guistics

Bill Yuchen Lin Xinyue Chen Jamin Chen and Xi-ang Ren 2019 KagNet Knowledge-aware graphnetworks for commonsense reasoning In Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) pages 2829ndash2839 HongKong China Association for Computational Lin-guistics

Liyuan Liu Haoming Jiang Pengcheng He WeizhuChen Xiaodong Liu Jianfeng Gao and Jiawei Han2019a On the variance of the adaptive learning rateand beyond CoRR abs190803265

Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019bRoberta A robustly optimized BERT pretraining ap-proach CoRR abs190711692

Shangwen Lv Daya Guo Jingjing Xu Duyu TangNan Duan Ming Gong Linjun Shou Daxin JiangGuihong Cao and Songlin Hu 2019 Graph-based reasoning over heterogeneous external knowl-edge for commonsense question answering CoRRabs190905311

Kaixin Ma Jonathan Francis Quanyang Lu Eric Ny-berg and Alessandro Oltramari 2019 Towardsgeneralizable neuro-symbolic systems for common-sense question answering In Proceedings of theFirst Workshop on Commonsense Inference in Natu-ral Language Processing pages 22ndash32 Hong KongChina Association for Computational Linguistics

Todor Mihaylov Peter Clark Tushar Khot and AshishSabharwal 2018 Can a suit of armor conduct elec-tricity a new dataset for open book question an-swering In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processingpages 2381ndash2391 Brussels Belgium Associationfor Computational Linguistics

Todor Mihaylov and Anette Frank 2018 Knowledge-able reader Enhancing cloze-style reading compre-hension with external commonsense knowledge InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 821ndash832 Melbourne AustraliaAssociation for Computational Linguistics

Giannis Nikolentzos George Dasoulas and MichalisVazirgiannis 2019 k-hop graph neural networksCoRR abs190706051

Xiaoman Pan Kai Sun Dian Yu Heng Ji and Dong Yu2019 Improving question answering with externalknowledge CoRR abs190200993

Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2019 Exploring the limitsof transfer learning with a unified text-to-text trans-former

Adam Santoro David Raposo David G T BarrettMateusz Malinowski Razvan Pascanu Peter WBattaglia and Tim Lillicrap 2017 A simple neu-ral network module for relational reasoning In Ad-vances in Neural Information Processing Systems30 Annual Conference on Neural Information Pro-cessing Systems 2017 4-9 December 2017 LongBeach CA USA pages 4967ndash4976

Maarten Sap Hannah Rashkin Derek Chen RonanLe Bras and Yejin Choi 2019 Social IQa Com-monsense reasoning about social interactions InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the

9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) pages 4463ndash4473 Hong Kong China Association for Computa-tional Linguistics

Michael Sejr Schlichtkrull Thomas N Kipf PeterBloem Rianne van den Berg Ivan Titov and MaxWelling 2018 Modeling relational data with graphconvolutional networks In The Semantic Web - 15thInternational Conference ESWC 2018 HeraklionCrete Greece June 3-7 2018 Proceedings volume10843 of Lecture Notes in Computer Science pages593ndash607 Springer

Robyn Speer Joshua Chin and Catherine Havasi 2017Conceptnet 55 An open multilingual graph of gen-eral knowledge In Proceedings of the Thirty-FirstAAAI Conference on Artificial Intelligence Febru-ary 4-9 2017 San Francisco California USApages 4444ndash4451 AAAI Press

Alon Talmor Jonathan Herzig Nicholas Lourie andJonathan Berant 2019 CommonsenseQA A ques-tion answering challenge targeting commonsenseknowledge In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4149ndash4158 Minneapolis Minnesota Associ-ation for Computational Linguistics

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez LukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems 30 Annual Conference on NeuralInformation Processing Systems 2017 4-9 Decem-ber 2017 Long Beach CA USA pages 5998ndash6008

Petar Velickovic Guillem Cucurull Arantxa CasanovaAdriana Romero Pietro Lio and Yoshua Bengio2018 Graph attention networks In 6th Inter-national Conference on Learning RepresentationsICLR 2018 Vancouver BC Canada April 30 - May3 2018 Conference Track Proceedings OpenRe-viewnet

Shuohang Wang and Jing Jiang 2016 Learning nat-ural language inference with LSTM In Proceed-ings of the 2016 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics Human Language Technologies pages1442ndash1451 San Diego California Association forComputational Linguistics

Xiaoyan Wang Pavan Kapanipathi Ryan MusaMo Yu Kartik Talamadupula Ibrahim AbdelazizMaria Chang Achille Fokoue Bassem MakniNicholas Mattei and Michael Witbrock 2019 Im-proving natural language inference using externalknowledge in the science questions domain In TheThirty-Third AAAI Conference on Artificial Intelli-gence AAAI 2019 The Thirty-First Innovative Ap-plications of Artificial Intelligence Conference IAAI2019 The Ninth AAAI Symposium on Educational

Advances in Artificial Intelligence EAAI 2019 Hon-olulu Hawaii USA January 27 - February 1 2019pages 7208ndash7215 AAAI Press

Dirk Weissenborn Tomas Kocisky and Chris Dyer2017 Dynamic integration of background knowl-edge in neural nlu systems arXiv preprintarXiv170602596

An Yang Quan Wang Jing Liu Kai Liu Yajuan LyuHua Wu Qiaoqiao She and Sujian Li 2019 En-hancing pre-trained language representations withrich knowledge for machine reading comprehensionIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2346ndash2357 Florence Italy Association for Compu-tational Linguistics

Bishan Yang and Tom Mitchell 2017 Leveragingknowledge bases in LSTMs for improving machinereading In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1 Long Papers) pages 1436ndash1446 Van-couver Canada Association for Computational Lin-guistics

Zhi-Xiu Ye Qian Chen Wen Wang and Zhen-HuaLing 2019 Align mask and select A sim-ple method for incorporating commonsense knowl-edge into language representation models CoRRabs190806725

Yuyu Zhang Hanjun Dai Kamil Toraman andLe Song 2018 Kgˆ2 Learning to reason scienceexam questions with contextual knowledge graphembeddings CoRR abs180512393

Chen Zhu Yu Cheng Zhe Gan Siqi Sun Tom Gold-stein and Jingjing Liu 2020 Freelb Enhanced ad-versarial training for natural language understandingIn International Conference on Learning Represen-tations

A Merging Types of Relations inConceptNet

Relation Merged Relation

AtLocationAtLocation

LocatedNear

CausesCausesCausesDesire

MotivatedByGoal

AntonymAntonym

DistinctFrom

HasSubevent

HasSubevent

HasFirstSubeventHasLastSubeventHasPrerequisite

EntailsMannerOf

IsAIsAInstanceOf

DefinedAs

PartOfPartOf

HasA

RelatedToRelatedToSimilarTo

Synonym

Table 7 Relations in ConceptNet that are being mergedin pre-processing RelationX indicates the reverse re-lation of RelationX

We merge relations that are close in semanticsas well as in the general usage of triple instancesin ConceptNet

B Dataset Split Specifications

Train Dev Test

CommonsenseQA (OF) 9 741 1 221 1 140CommonsenseQA (IH) 8 500 1 221 1 241OpenbookQA 4 957 500 500

Table 8 Numbers of instances in different datasetsplits

CommonsenseQA 7 and OpenbookQA 8 all havetheir leaderboards with training and development

7httpswwwtau-nlporgcommonsenseqa8httpsleaderboardallenaiorgopen_

book_qasubmissionspublic

set publicly available As the ground truth labelsfor CommonsenseQA are not readily available formodel analysis we take 1241 examples from of-ficial training examples as our in-house test ex-amples and regard the remaining 8500 ones asour in-house training examples (CommonsenseQA(IH))

C Implementation Details

CommonsenseQA OpenbookQA

BERT-BASE 3 times 10minus5 -

BERT-LARGE 2 times 10minus5 -

ROBERTA-LARGE 1 times 10minus5

1 times 10minus5

Table 9 Learning rate for text encoders on differentdatasets

CommonsenseQA OpenbookQA

RN 3 times 10minus4

3 times 10minus4

RGCN 1 times 10minus3

1 times 10minus3

GconAttn 3 times 10minus4

1 times 10minus4

MHGRN 1 times 10minus3

1 times 10minus3

Table 10 Learning rate for graph encoders on differentdatasets

Param

RN 399KRGCN 365KGconAttn 453KMHGRN 544K

Table 11 Numbers of parameters of different graph en-coders

Our models are implemented in PyTorch We usecross-entropy loss and RAdam (Liu et al 2019a)optimizer We find it beneficial to use separatelearning rates for the text encoder and the graph en-coder We tune learning rates for text encoders andgraph encoders on two datasets We first fine-tuneROBERTA-LARGE BERT-LARGE BERT-BASE

on CommonsenseQA and ROBERTA-LARGE onOpenbookQA respectively and choose a dataset-specific learning rate from 1times10

minus5 2times10

minus5 3times

10minus5 6 times 10

minus5 1 times 10

minus4 for each text encoderbased on the best performance on development setas listed in Table 9 We report the performance ofthese fine-tuned text encoders and also adopt theirdataset-specific optimal learning rates in joint train-ing with graph encoders For models that involve

KG the learning rate of their graph encoders arechosen from 1 times 10

minus4 3 times 10

minus4 1 times 10

minus3 3 times

10minus3 based on their best development set perfor-

mance with ROBERTA-LARGE as the text encoderWe report the optimal learning rates for graph en-coders in Table 10 In training we set the max-imum input sequence length to text encoders to64 batch size to 32 and perform early stoppingAristoRoBERTaV7+MHGRN is the only exceptionIn order to host fair comparison we follow Aris-toRoBERTaV7 and set the batch size to 16 maxinput sequence length to 256 and choose a decoderlearning rate from 1 times 10

minus3 2 times 10

minus5For the input node features we first use tem-

plates to turn knowledge triples in ConceptNet intosentences and feed them into pre-trained BERT-LARGE obtaining a sequence of token embeddingsfrom the last layer of BERT-LARGE for each tripleFor each entity in ConceptNet we perform meanpooling over the tokens of the entityrsquos occurrencesacross all the sentences to form a 1024d vector asits corresponding node feature We use this set offeatures for all our implemented models

We use 2-layer RGCN and single-layer MHGRNacross our experiments

The numbers of parameter for each graph en-coder are listed in Table 11

D Dynamic Programming Algorithm forEq 7

To show that multi-hop message passing can becomputed in linear time we observe that Eq 7 canbe re-written in matrix form

Zk= (Dk)minus1 sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FXW1r1

⊤⋯W

krk

sdotWk+10

⊤⋯W

K0

⊤ (1 le k le K) (13)

where G = diag(exp([g(φ(v1) s) g(φ(vn)s)]) (F is similarly defined) Ar is the adjacencymatrix for relation r and D

k is defined as follows

Dk= diag( sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FX1) (1 le k le K) (14)

Using this matrix formulation we can compute Eq7 using dynamic programming

Algorithm 1 Dynamic programming algorithm formulti-hop message passingInput sXAr(1 le r le m)W t

r (r isin R 1 le t le

k)F G δ τOutput Z

1 WKlarr I

2 for k larr K minus 1 to 1 do3 W

klarrW

k+10 W

k+1

4 end for5 for r isin R do6 Mr larr FX7 end for8 for k larr 1 to K do9 if k gt 1 then

10 for r isin R do11 M

primer larr e

δ(rs)ArsumrprimeisinR e

τ(rprimers)MrprimeW

kr

12 end for13 for r isin R do14 Mr larrM

primer

15 end for16 else17 for r isin R do18 Mr larr e

δ(rs) sdotArMrWkr

19 end for20 end if21 Z

klarr GsumrisinR MrW

k

22 end for23 Replace W

tr (0 le r le m 1 le t le k) with identity

matrices and X with 1 and re-run line 1 - line 19 tocompute d

1 d

K

24 for k larr 1 to K do25 Z

klarr (diag(dk))minus1

Zk

26 end for27 return Z

1Z

2 Z

k

E Formal Definition of K-hop RN

Definition 1 (K-hop Relation Network) A multi-hop relation network is a function that maps amulti-relational graph to a fixed size vector

KHopRN(G W E H) =K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)sdotW (hjoplus(er1⋯erk)oplushi)(15)

where denotes element-wise product and β(⋯) =1(K∣A∣ sdot ∣(j prime i) isin G ∣ j prime isin Q∣) defines thepooling weights

F Expressing K-hop RN with MultiGRN

Theorem 1 Given any W E H there exists a pa-rameter setting such that the output of the modelbecomes KHopRN(GW E H) for arbitrary G

Proof Suppose W = [W1 W2 W3] whereW1 W3 isin Rd3timesd1 W2 isin Rd3timesd2 ForMHRGN we set the parameters as followsH = HUlowast = [I0] isin R(d1+d2)timesd1 blowast =

[01]⊤ isin Rd1+d2 Wtr = diag(1 oplus er) isin

R(d1+d2)times(d1+d2)(r isin R 1 le t le K)V = W3 isin

Rd3timesd1 Vprime= [W1 W2] isin Rd3times(d1+d2) We dis-

able the relation type attention module and enablemessage passing only from Q to A By furtherchoosing σ as the identity function and perform-ing pooling over A we observe that the output of

MultiGRN becomes

1

∣A∣ sumiisinA

hprimei

=1

∣A∣ (V hi + Vprimezi)

=1

K∣A∣K

sumk=1

(V hi + Vprimezki )

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)(V hi+

VprimeW

krk⋯W

1r1xj)

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ

β(⋯)(V hi + VprimeW

krk

⋯W1r1Uφ(j)hj + V

primeW

krk⋯W

1r1bφ(j))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)(W3hi + W1hj

+ W2(er1 ⋯ erk))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)W (hjoplus

(er1 ⋯ erk)oplus hi)= RN(G W E H)

(16)

Page 9: arXiv:2005.00646v1 [cs.CL] 1 May 20200 80 #node 3.1e3 #path K=2 0 80 #node 6.3e4 K=2 K=3 Figure 2: Number of K-hop relational paths w.r.t. the node count in extracted graphs on Common-senseQA

relational reasoning and is empirically shown tooutperform existing methods with superior scal-ablility and interpretability

ReferencesSami Abu-El-Haija Bryan Perozzi Amol Kapoor

Nazanin Alipourfard Kristina Lerman Hrayr Haru-tyunyan Greg Ver Steeg and Aram Galstyan 2019Mixhop Higher-order graph convolutional architec-tures via sparsified neighborhood mixing In Pro-ceedings of the 36th International Conference onMachine Learning ICML 2019 9-15 June 2019Long Beach California USA volume 97 of Pro-ceedings of Machine Learning Research pages 21ndash29 PMLR

Pratyay Banerjee Kuntal Kumar Pal Arindam Mi-tra and Chitta Baral 2019 Careful selection ofknowledge to solve open book question answeringIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages6120ndash6129 Florence Italy Association for Compu-tational Linguistics

Lisa Bauer Yicheng Wang and Mohit Bansal 2018Commonsense for generative multi-hop question an-swering tasks In Proceedings of the 2018 Confer-ence on Empirical Methods in Natural LanguageProcessing Brussels Belgium October 31 - Novem-ber 4 2018 pages 4220ndash4230 Association for Com-putational Linguistics

Peter Clark Isaac Cowhey Oren Etzioni Tushar KhotAshish Sabharwal Carissa Schoenick and OyvindTafjord 2018 Think you have solved questionanswering try arc the AI2 reasoning challengeCoRR abs180305457

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

Justin Gilmer Samuel S Schoenholz Patrick F RileyOriol Vinyals and George E Dahl 2017 Neuralmessage passing for quantum chemistry In Pro-ceedings of the 34th International Conference onMachine Learning ICML 2017 Sydney NSW Aus-tralia 6-11 August 2017 volume 70 of Proceedingsof Machine Learning Research pages 1263ndash1272PMLR

Daniel Khashabi Tushar Khot Ashish SabharwalOyvind Tafjord Peter Clark and Hannaneh Ha-jishirzi 2020 Unifiedqa Crossing format bound-aries with a single qa system

Thomas N Kipf and Max Welling 2017 Semi-supervised classification with graph convolutionalnetworks In 5th International Conference on Learn-ing Representations ICLR 2017 Toulon FranceApril 24-26 2017 Conference Track ProceedingsOpenReviewnet

Souvik Kundu Tushar Khot Ashish Sabharwal andPeter Clark 2019 Exploiting explicit paths formulti-hop reading comprehension In Proceedingsof the 57th Annual Meeting of the Association forComputational Linguistics pages 2737ndash2747 Flo-rence Italy Association for Computational Linguis-tics

John D Lafferty Andrew McCallum and FernandoC N Pereira 2001 Conditional random fieldsProbabilistic models for segmenting and labeling se-quence data In Proceedings of the Eighteenth Inter-national Conference on Machine Learning (ICML2001) Williams College Williamstown MA USAJune 28 - July 1 2001 pages 282ndash289 MorganKaufmann

Guokun Lai Qizhe Xie Hanxiao Liu Yiming Yangand Eduard Hovy 2017 RACE Large-scale ReAd-ing comprehension dataset from examinations InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing pages785ndash794 Copenhagen Denmark Association forComputational Linguistics

Zhen-Zhong Lan Mingda Chen Sebastian Good-man Kevin Gimpel Piyush Sharma and RaduSoricut 2019 Albert A lite bert for self-supervised learning of language representationsArXiv abs190911942

Shiyang Li Jianshu Chen and Dian Yu 2019 Teach-ing pretrained models with commonsense reason-ing A preliminary kb-based approach CoRRabs190909743

Yang Li and Peter Clark 2015 Answering elementaryscience questions by constructing coherent scenesusing background knowledge In Proceedings ofthe 2015 Conference on Empirical Methods in Nat-ural Language Processing pages 2007ndash2012 Lis-bon Portugal Association for Computational Lin-guistics

Bill Yuchen Lin Xinyue Chen Jamin Chen and Xi-ang Ren 2019 KagNet Knowledge-aware graphnetworks for commonsense reasoning In Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) pages 2829ndash2839 HongKong China Association for Computational Lin-guistics

Liyuan Liu Haoming Jiang Pengcheng He WeizhuChen Xiaodong Liu Jianfeng Gao and Jiawei Han2019a On the variance of the adaptive learning rateand beyond CoRR abs190803265

Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019bRoberta A robustly optimized BERT pretraining ap-proach CoRR abs190711692

Shangwen Lv Daya Guo Jingjing Xu Duyu TangNan Duan Ming Gong Linjun Shou Daxin JiangGuihong Cao and Songlin Hu 2019 Graph-based reasoning over heterogeneous external knowl-edge for commonsense question answering CoRRabs190905311

Kaixin Ma Jonathan Francis Quanyang Lu Eric Ny-berg and Alessandro Oltramari 2019 Towardsgeneralizable neuro-symbolic systems for common-sense question answering In Proceedings of theFirst Workshop on Commonsense Inference in Natu-ral Language Processing pages 22ndash32 Hong KongChina Association for Computational Linguistics

Todor Mihaylov Peter Clark Tushar Khot and AshishSabharwal 2018 Can a suit of armor conduct elec-tricity a new dataset for open book question an-swering In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processingpages 2381ndash2391 Brussels Belgium Associationfor Computational Linguistics

Todor Mihaylov and Anette Frank 2018 Knowledge-able reader Enhancing cloze-style reading compre-hension with external commonsense knowledge InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 821ndash832 Melbourne AustraliaAssociation for Computational Linguistics

Giannis Nikolentzos George Dasoulas and MichalisVazirgiannis 2019 k-hop graph neural networksCoRR abs190706051

Xiaoman Pan Kai Sun Dian Yu Heng Ji and Dong Yu2019 Improving question answering with externalknowledge CoRR abs190200993

Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2019 Exploring the limitsof transfer learning with a unified text-to-text trans-former

Adam Santoro David Raposo David G T BarrettMateusz Malinowski Razvan Pascanu Peter WBattaglia and Tim Lillicrap 2017 A simple neu-ral network module for relational reasoning In Ad-vances in Neural Information Processing Systems30 Annual Conference on Neural Information Pro-cessing Systems 2017 4-9 December 2017 LongBeach CA USA pages 4967ndash4976

Maarten Sap Hannah Rashkin Derek Chen RonanLe Bras and Yejin Choi 2019 Social IQa Com-monsense reasoning about social interactions InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the

9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) pages 4463ndash4473 Hong Kong China Association for Computa-tional Linguistics

Michael Sejr Schlichtkrull Thomas N Kipf PeterBloem Rianne van den Berg Ivan Titov and MaxWelling 2018 Modeling relational data with graphconvolutional networks In The Semantic Web - 15thInternational Conference ESWC 2018 HeraklionCrete Greece June 3-7 2018 Proceedings volume10843 of Lecture Notes in Computer Science pages593ndash607 Springer

Robyn Speer Joshua Chin and Catherine Havasi 2017Conceptnet 55 An open multilingual graph of gen-eral knowledge In Proceedings of the Thirty-FirstAAAI Conference on Artificial Intelligence Febru-ary 4-9 2017 San Francisco California USApages 4444ndash4451 AAAI Press

Alon Talmor Jonathan Herzig Nicholas Lourie andJonathan Berant 2019 CommonsenseQA A ques-tion answering challenge targeting commonsenseknowledge In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4149ndash4158 Minneapolis Minnesota Associ-ation for Computational Linguistics

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez LukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems 30 Annual Conference on NeuralInformation Processing Systems 2017 4-9 Decem-ber 2017 Long Beach CA USA pages 5998ndash6008

Petar Velickovic Guillem Cucurull Arantxa CasanovaAdriana Romero Pietro Lio and Yoshua Bengio2018 Graph attention networks In 6th Inter-national Conference on Learning RepresentationsICLR 2018 Vancouver BC Canada April 30 - May3 2018 Conference Track Proceedings OpenRe-viewnet

Shuohang Wang and Jing Jiang 2016 Learning nat-ural language inference with LSTM In Proceed-ings of the 2016 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics Human Language Technologies pages1442ndash1451 San Diego California Association forComputational Linguistics

Xiaoyan Wang Pavan Kapanipathi Ryan MusaMo Yu Kartik Talamadupula Ibrahim AbdelazizMaria Chang Achille Fokoue Bassem MakniNicholas Mattei and Michael Witbrock 2019 Im-proving natural language inference using externalknowledge in the science questions domain In TheThirty-Third AAAI Conference on Artificial Intelli-gence AAAI 2019 The Thirty-First Innovative Ap-plications of Artificial Intelligence Conference IAAI2019 The Ninth AAAI Symposium on Educational

Advances in Artificial Intelligence EAAI 2019 Hon-olulu Hawaii USA January 27 - February 1 2019pages 7208ndash7215 AAAI Press

Dirk Weissenborn Tomas Kocisky and Chris Dyer2017 Dynamic integration of background knowl-edge in neural nlu systems arXiv preprintarXiv170602596

An Yang Quan Wang Jing Liu Kai Liu Yajuan LyuHua Wu Qiaoqiao She and Sujian Li 2019 En-hancing pre-trained language representations withrich knowledge for machine reading comprehensionIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2346ndash2357 Florence Italy Association for Compu-tational Linguistics

Bishan Yang and Tom Mitchell 2017 Leveragingknowledge bases in LSTMs for improving machinereading In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1 Long Papers) pages 1436ndash1446 Van-couver Canada Association for Computational Lin-guistics

Zhi-Xiu Ye Qian Chen Wen Wang and Zhen-HuaLing 2019 Align mask and select A sim-ple method for incorporating commonsense knowl-edge into language representation models CoRRabs190806725

Yuyu Zhang Hanjun Dai Kamil Toraman andLe Song 2018 Kgˆ2 Learning to reason scienceexam questions with contextual knowledge graphembeddings CoRR abs180512393

Chen Zhu Yu Cheng Zhe Gan Siqi Sun Tom Gold-stein and Jingjing Liu 2020 Freelb Enhanced ad-versarial training for natural language understandingIn International Conference on Learning Represen-tations

A Merging Types of Relations inConceptNet

Relation Merged Relation

AtLocationAtLocation

LocatedNear

CausesCausesCausesDesire

MotivatedByGoal

AntonymAntonym

DistinctFrom

HasSubevent

HasSubevent

HasFirstSubeventHasLastSubeventHasPrerequisite

EntailsMannerOf

IsAIsAInstanceOf

DefinedAs

PartOfPartOf

HasA

RelatedToRelatedToSimilarTo

Synonym

Table 7 Relations in ConceptNet that are being mergedin pre-processing RelationX indicates the reverse re-lation of RelationX

We merge relations that are close in semanticsas well as in the general usage of triple instancesin ConceptNet

B Dataset Split Specifications

Train Dev Test

CommonsenseQA (OF) 9 741 1 221 1 140CommonsenseQA (IH) 8 500 1 221 1 241OpenbookQA 4 957 500 500

Table 8 Numbers of instances in different datasetsplits

CommonsenseQA 7 and OpenbookQA 8 all havetheir leaderboards with training and development

7httpswwwtau-nlporgcommonsenseqa8httpsleaderboardallenaiorgopen_

book_qasubmissionspublic

set publicly available As the ground truth labelsfor CommonsenseQA are not readily available formodel analysis we take 1241 examples from of-ficial training examples as our in-house test ex-amples and regard the remaining 8500 ones asour in-house training examples (CommonsenseQA(IH))

C Implementation Details

CommonsenseQA OpenbookQA

BERT-BASE 3 times 10minus5 -

BERT-LARGE 2 times 10minus5 -

ROBERTA-LARGE 1 times 10minus5

1 times 10minus5

Table 9 Learning rate for text encoders on differentdatasets

CommonsenseQA OpenbookQA

RN 3 times 10minus4

3 times 10minus4

RGCN 1 times 10minus3

1 times 10minus3

GconAttn 3 times 10minus4

1 times 10minus4

MHGRN 1 times 10minus3

1 times 10minus3

Table 10 Learning rate for graph encoders on differentdatasets

Param

RN 399KRGCN 365KGconAttn 453KMHGRN 544K

Table 11 Numbers of parameters of different graph en-coders

Our models are implemented in PyTorch We usecross-entropy loss and RAdam (Liu et al 2019a)optimizer We find it beneficial to use separatelearning rates for the text encoder and the graph en-coder We tune learning rates for text encoders andgraph encoders on two datasets We first fine-tuneROBERTA-LARGE BERT-LARGE BERT-BASE

on CommonsenseQA and ROBERTA-LARGE onOpenbookQA respectively and choose a dataset-specific learning rate from 1times10

minus5 2times10

minus5 3times

10minus5 6 times 10

minus5 1 times 10

minus4 for each text encoderbased on the best performance on development setas listed in Table 9 We report the performance ofthese fine-tuned text encoders and also adopt theirdataset-specific optimal learning rates in joint train-ing with graph encoders For models that involve

KG the learning rate of their graph encoders arechosen from 1 times 10

minus4 3 times 10

minus4 1 times 10

minus3 3 times

10minus3 based on their best development set perfor-

mance with ROBERTA-LARGE as the text encoderWe report the optimal learning rates for graph en-coders in Table 10 In training we set the max-imum input sequence length to text encoders to64 batch size to 32 and perform early stoppingAristoRoBERTaV7+MHGRN is the only exceptionIn order to host fair comparison we follow Aris-toRoBERTaV7 and set the batch size to 16 maxinput sequence length to 256 and choose a decoderlearning rate from 1 times 10

minus3 2 times 10

minus5For the input node features we first use tem-

plates to turn knowledge triples in ConceptNet intosentences and feed them into pre-trained BERT-LARGE obtaining a sequence of token embeddingsfrom the last layer of BERT-LARGE for each tripleFor each entity in ConceptNet we perform meanpooling over the tokens of the entityrsquos occurrencesacross all the sentences to form a 1024d vector asits corresponding node feature We use this set offeatures for all our implemented models

We use 2-layer RGCN and single-layer MHGRNacross our experiments

The numbers of parameter for each graph en-coder are listed in Table 11

D Dynamic Programming Algorithm forEq 7

To show that multi-hop message passing can becomputed in linear time we observe that Eq 7 canbe re-written in matrix form

Zk= (Dk)minus1 sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FXW1r1

⊤⋯W

krk

sdotWk+10

⊤⋯W

K0

⊤ (1 le k le K) (13)

where G = diag(exp([g(φ(v1) s) g(φ(vn)s)]) (F is similarly defined) Ar is the adjacencymatrix for relation r and D

k is defined as follows

Dk= diag( sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FX1) (1 le k le K) (14)

Using this matrix formulation we can compute Eq7 using dynamic programming

Algorithm 1 Dynamic programming algorithm formulti-hop message passingInput sXAr(1 le r le m)W t

r (r isin R 1 le t le

k)F G δ τOutput Z

1 WKlarr I

2 for k larr K minus 1 to 1 do3 W

klarrW

k+10 W

k+1

4 end for5 for r isin R do6 Mr larr FX7 end for8 for k larr 1 to K do9 if k gt 1 then

10 for r isin R do11 M

primer larr e

δ(rs)ArsumrprimeisinR e

τ(rprimers)MrprimeW

kr

12 end for13 for r isin R do14 Mr larrM

primer

15 end for16 else17 for r isin R do18 Mr larr e

δ(rs) sdotArMrWkr

19 end for20 end if21 Z

klarr GsumrisinR MrW

k

22 end for23 Replace W

tr (0 le r le m 1 le t le k) with identity

matrices and X with 1 and re-run line 1 - line 19 tocompute d

1 d

K

24 for k larr 1 to K do25 Z

klarr (diag(dk))minus1

Zk

26 end for27 return Z

1Z

2 Z

k

E Formal Definition of K-hop RN

Definition 1 (K-hop Relation Network) A multi-hop relation network is a function that maps amulti-relational graph to a fixed size vector

KHopRN(G W E H) =K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)sdotW (hjoplus(er1⋯erk)oplushi)(15)

where denotes element-wise product and β(⋯) =1(K∣A∣ sdot ∣(j prime i) isin G ∣ j prime isin Q∣) defines thepooling weights

F Expressing K-hop RN with MultiGRN

Theorem 1 Given any W E H there exists a pa-rameter setting such that the output of the modelbecomes KHopRN(GW E H) for arbitrary G

Proof Suppose W = [W1 W2 W3] whereW1 W3 isin Rd3timesd1 W2 isin Rd3timesd2 ForMHRGN we set the parameters as followsH = HUlowast = [I0] isin R(d1+d2)timesd1 blowast =

[01]⊤ isin Rd1+d2 Wtr = diag(1 oplus er) isin

R(d1+d2)times(d1+d2)(r isin R 1 le t le K)V = W3 isin

Rd3timesd1 Vprime= [W1 W2] isin Rd3times(d1+d2) We dis-

able the relation type attention module and enablemessage passing only from Q to A By furtherchoosing σ as the identity function and perform-ing pooling over A we observe that the output of

MultiGRN becomes

1

∣A∣ sumiisinA

hprimei

=1

∣A∣ (V hi + Vprimezi)

=1

K∣A∣K

sumk=1

(V hi + Vprimezki )

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)(V hi+

VprimeW

krk⋯W

1r1xj)

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ

β(⋯)(V hi + VprimeW

krk

⋯W1r1Uφ(j)hj + V

primeW

krk⋯W

1r1bφ(j))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)(W3hi + W1hj

+ W2(er1 ⋯ erk))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)W (hjoplus

(er1 ⋯ erk)oplus hi)= RN(G W E H)

(16)

Page 10: arXiv:2005.00646v1 [cs.CL] 1 May 20200 80 #node 3.1e3 #path K=2 0 80 #node 6.3e4 K=2 K=3 Figure 2: Number of K-hop relational paths w.r.t. the node count in extracted graphs on Common-senseQA

Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019bRoberta A robustly optimized BERT pretraining ap-proach CoRR abs190711692

Shangwen Lv Daya Guo Jingjing Xu Duyu TangNan Duan Ming Gong Linjun Shou Daxin JiangGuihong Cao and Songlin Hu 2019 Graph-based reasoning over heterogeneous external knowl-edge for commonsense question answering CoRRabs190905311

Kaixin Ma Jonathan Francis Quanyang Lu Eric Ny-berg and Alessandro Oltramari 2019 Towardsgeneralizable neuro-symbolic systems for common-sense question answering In Proceedings of theFirst Workshop on Commonsense Inference in Natu-ral Language Processing pages 22ndash32 Hong KongChina Association for Computational Linguistics

Todor Mihaylov Peter Clark Tushar Khot and AshishSabharwal 2018 Can a suit of armor conduct elec-tricity a new dataset for open book question an-swering In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processingpages 2381ndash2391 Brussels Belgium Associationfor Computational Linguistics

Todor Mihaylov and Anette Frank 2018 Knowledge-able reader Enhancing cloze-style reading compre-hension with external commonsense knowledge InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 821ndash832 Melbourne AustraliaAssociation for Computational Linguistics

Giannis Nikolentzos George Dasoulas and MichalisVazirgiannis 2019 k-hop graph neural networksCoRR abs190706051

Xiaoman Pan Kai Sun Dian Yu Heng Ji and Dong Yu2019 Improving question answering with externalknowledge CoRR abs190200993

Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2019 Exploring the limitsof transfer learning with a unified text-to-text trans-former

Adam Santoro David Raposo David G T BarrettMateusz Malinowski Razvan Pascanu Peter WBattaglia and Tim Lillicrap 2017 A simple neu-ral network module for relational reasoning In Ad-vances in Neural Information Processing Systems30 Annual Conference on Neural Information Pro-cessing Systems 2017 4-9 December 2017 LongBeach CA USA pages 4967ndash4976

Maarten Sap Hannah Rashkin Derek Chen RonanLe Bras and Yejin Choi 2019 Social IQa Com-monsense reasoning about social interactions InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the

9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) pages 4463ndash4473 Hong Kong China Association for Computa-tional Linguistics

Michael Sejr Schlichtkrull Thomas N Kipf PeterBloem Rianne van den Berg Ivan Titov and MaxWelling 2018 Modeling relational data with graphconvolutional networks In The Semantic Web - 15thInternational Conference ESWC 2018 HeraklionCrete Greece June 3-7 2018 Proceedings volume10843 of Lecture Notes in Computer Science pages593ndash607 Springer

Robyn Speer Joshua Chin and Catherine Havasi 2017Conceptnet 55 An open multilingual graph of gen-eral knowledge In Proceedings of the Thirty-FirstAAAI Conference on Artificial Intelligence Febru-ary 4-9 2017 San Francisco California USApages 4444ndash4451 AAAI Press

Alon Talmor Jonathan Herzig Nicholas Lourie andJonathan Berant 2019 CommonsenseQA A ques-tion answering challenge targeting commonsenseknowledge In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4149ndash4158 Minneapolis Minnesota Associ-ation for Computational Linguistics

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez LukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems 30 Annual Conference on NeuralInformation Processing Systems 2017 4-9 Decem-ber 2017 Long Beach CA USA pages 5998ndash6008

Petar Velickovic Guillem Cucurull Arantxa CasanovaAdriana Romero Pietro Lio and Yoshua Bengio2018 Graph attention networks In 6th Inter-national Conference on Learning RepresentationsICLR 2018 Vancouver BC Canada April 30 - May3 2018 Conference Track Proceedings OpenRe-viewnet

Shuohang Wang and Jing Jiang 2016 Learning nat-ural language inference with LSTM In Proceed-ings of the 2016 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics Human Language Technologies pages1442ndash1451 San Diego California Association forComputational Linguistics

Xiaoyan Wang Pavan Kapanipathi Ryan MusaMo Yu Kartik Talamadupula Ibrahim AbdelazizMaria Chang Achille Fokoue Bassem MakniNicholas Mattei and Michael Witbrock 2019 Im-proving natural language inference using externalknowledge in the science questions domain In TheThirty-Third AAAI Conference on Artificial Intelli-gence AAAI 2019 The Thirty-First Innovative Ap-plications of Artificial Intelligence Conference IAAI2019 The Ninth AAAI Symposium on Educational

Advances in Artificial Intelligence EAAI 2019 Hon-olulu Hawaii USA January 27 - February 1 2019pages 7208ndash7215 AAAI Press

Dirk Weissenborn Tomas Kocisky and Chris Dyer2017 Dynamic integration of background knowl-edge in neural nlu systems arXiv preprintarXiv170602596

An Yang Quan Wang Jing Liu Kai Liu Yajuan LyuHua Wu Qiaoqiao She and Sujian Li 2019 En-hancing pre-trained language representations withrich knowledge for machine reading comprehensionIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2346ndash2357 Florence Italy Association for Compu-tational Linguistics

Bishan Yang and Tom Mitchell 2017 Leveragingknowledge bases in LSTMs for improving machinereading In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1 Long Papers) pages 1436ndash1446 Van-couver Canada Association for Computational Lin-guistics

Zhi-Xiu Ye Qian Chen Wen Wang and Zhen-HuaLing 2019 Align mask and select A sim-ple method for incorporating commonsense knowl-edge into language representation models CoRRabs190806725

Yuyu Zhang Hanjun Dai Kamil Toraman andLe Song 2018 Kgˆ2 Learning to reason scienceexam questions with contextual knowledge graphembeddings CoRR abs180512393

Chen Zhu Yu Cheng Zhe Gan Siqi Sun Tom Gold-stein and Jingjing Liu 2020 Freelb Enhanced ad-versarial training for natural language understandingIn International Conference on Learning Represen-tations

A Merging Types of Relations inConceptNet

Relation Merged Relation

AtLocationAtLocation

LocatedNear

CausesCausesCausesDesire

MotivatedByGoal

AntonymAntonym

DistinctFrom

HasSubevent

HasSubevent

HasFirstSubeventHasLastSubeventHasPrerequisite

EntailsMannerOf

IsAIsAInstanceOf

DefinedAs

PartOfPartOf

HasA

RelatedToRelatedToSimilarTo

Synonym

Table 7 Relations in ConceptNet that are being mergedin pre-processing RelationX indicates the reverse re-lation of RelationX

We merge relations that are close in semanticsas well as in the general usage of triple instancesin ConceptNet

B Dataset Split Specifications

Train Dev Test

CommonsenseQA (OF) 9 741 1 221 1 140CommonsenseQA (IH) 8 500 1 221 1 241OpenbookQA 4 957 500 500

Table 8 Numbers of instances in different datasetsplits

CommonsenseQA 7 and OpenbookQA 8 all havetheir leaderboards with training and development

7httpswwwtau-nlporgcommonsenseqa8httpsleaderboardallenaiorgopen_

book_qasubmissionspublic

set publicly available As the ground truth labelsfor CommonsenseQA are not readily available formodel analysis we take 1241 examples from of-ficial training examples as our in-house test ex-amples and regard the remaining 8500 ones asour in-house training examples (CommonsenseQA(IH))

C Implementation Details

CommonsenseQA OpenbookQA

BERT-BASE 3 times 10minus5 -

BERT-LARGE 2 times 10minus5 -

ROBERTA-LARGE 1 times 10minus5

1 times 10minus5

Table 9 Learning rate for text encoders on differentdatasets

CommonsenseQA OpenbookQA

RN 3 times 10minus4

3 times 10minus4

RGCN 1 times 10minus3

1 times 10minus3

GconAttn 3 times 10minus4

1 times 10minus4

MHGRN 1 times 10minus3

1 times 10minus3

Table 10 Learning rate for graph encoders on differentdatasets

Param

RN 399KRGCN 365KGconAttn 453KMHGRN 544K

Table 11 Numbers of parameters of different graph en-coders

Our models are implemented in PyTorch We usecross-entropy loss and RAdam (Liu et al 2019a)optimizer We find it beneficial to use separatelearning rates for the text encoder and the graph en-coder We tune learning rates for text encoders andgraph encoders on two datasets We first fine-tuneROBERTA-LARGE BERT-LARGE BERT-BASE

on CommonsenseQA and ROBERTA-LARGE onOpenbookQA respectively and choose a dataset-specific learning rate from 1times10

minus5 2times10

minus5 3times

10minus5 6 times 10

minus5 1 times 10

minus4 for each text encoderbased on the best performance on development setas listed in Table 9 We report the performance ofthese fine-tuned text encoders and also adopt theirdataset-specific optimal learning rates in joint train-ing with graph encoders For models that involve

KG the learning rate of their graph encoders arechosen from 1 times 10

minus4 3 times 10

minus4 1 times 10

minus3 3 times

10minus3 based on their best development set perfor-

mance with ROBERTA-LARGE as the text encoderWe report the optimal learning rates for graph en-coders in Table 10 In training we set the max-imum input sequence length to text encoders to64 batch size to 32 and perform early stoppingAristoRoBERTaV7+MHGRN is the only exceptionIn order to host fair comparison we follow Aris-toRoBERTaV7 and set the batch size to 16 maxinput sequence length to 256 and choose a decoderlearning rate from 1 times 10

minus3 2 times 10

minus5For the input node features we first use tem-

plates to turn knowledge triples in ConceptNet intosentences and feed them into pre-trained BERT-LARGE obtaining a sequence of token embeddingsfrom the last layer of BERT-LARGE for each tripleFor each entity in ConceptNet we perform meanpooling over the tokens of the entityrsquos occurrencesacross all the sentences to form a 1024d vector asits corresponding node feature We use this set offeatures for all our implemented models

We use 2-layer RGCN and single-layer MHGRNacross our experiments

The numbers of parameter for each graph en-coder are listed in Table 11

D Dynamic Programming Algorithm forEq 7

To show that multi-hop message passing can becomputed in linear time we observe that Eq 7 canbe re-written in matrix form

Zk= (Dk)minus1 sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FXW1r1

⊤⋯W

krk

sdotWk+10

⊤⋯W

K0

⊤ (1 le k le K) (13)

where G = diag(exp([g(φ(v1) s) g(φ(vn)s)]) (F is similarly defined) Ar is the adjacencymatrix for relation r and D

k is defined as follows

Dk= diag( sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FX1) (1 le k le K) (14)

Using this matrix formulation we can compute Eq7 using dynamic programming

Algorithm 1 Dynamic programming algorithm formulti-hop message passingInput sXAr(1 le r le m)W t

r (r isin R 1 le t le

k)F G δ τOutput Z

1 WKlarr I

2 for k larr K minus 1 to 1 do3 W

klarrW

k+10 W

k+1

4 end for5 for r isin R do6 Mr larr FX7 end for8 for k larr 1 to K do9 if k gt 1 then

10 for r isin R do11 M

primer larr e

δ(rs)ArsumrprimeisinR e

τ(rprimers)MrprimeW

kr

12 end for13 for r isin R do14 Mr larrM

primer

15 end for16 else17 for r isin R do18 Mr larr e

δ(rs) sdotArMrWkr

19 end for20 end if21 Z

klarr GsumrisinR MrW

k

22 end for23 Replace W

tr (0 le r le m 1 le t le k) with identity

matrices and X with 1 and re-run line 1 - line 19 tocompute d

1 d

K

24 for k larr 1 to K do25 Z

klarr (diag(dk))minus1

Zk

26 end for27 return Z

1Z

2 Z

k

E Formal Definition of K-hop RN

Definition 1 (K-hop Relation Network) A multi-hop relation network is a function that maps amulti-relational graph to a fixed size vector

KHopRN(G W E H) =K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)sdotW (hjoplus(er1⋯erk)oplushi)(15)

where denotes element-wise product and β(⋯) =1(K∣A∣ sdot ∣(j prime i) isin G ∣ j prime isin Q∣) defines thepooling weights

F Expressing K-hop RN with MultiGRN

Theorem 1 Given any W E H there exists a pa-rameter setting such that the output of the modelbecomes KHopRN(GW E H) for arbitrary G

Proof Suppose W = [W1 W2 W3] whereW1 W3 isin Rd3timesd1 W2 isin Rd3timesd2 ForMHRGN we set the parameters as followsH = HUlowast = [I0] isin R(d1+d2)timesd1 blowast =

[01]⊤ isin Rd1+d2 Wtr = diag(1 oplus er) isin

R(d1+d2)times(d1+d2)(r isin R 1 le t le K)V = W3 isin

Rd3timesd1 Vprime= [W1 W2] isin Rd3times(d1+d2) We dis-

able the relation type attention module and enablemessage passing only from Q to A By furtherchoosing σ as the identity function and perform-ing pooling over A we observe that the output of

MultiGRN becomes

1

∣A∣ sumiisinA

hprimei

=1

∣A∣ (V hi + Vprimezi)

=1

K∣A∣K

sumk=1

(V hi + Vprimezki )

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)(V hi+

VprimeW

krk⋯W

1r1xj)

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ

β(⋯)(V hi + VprimeW

krk

⋯W1r1Uφ(j)hj + V

primeW

krk⋯W

1r1bφ(j))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)(W3hi + W1hj

+ W2(er1 ⋯ erk))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)W (hjoplus

(er1 ⋯ erk)oplus hi)= RN(G W E H)

(16)

Page 11: arXiv:2005.00646v1 [cs.CL] 1 May 20200 80 #node 3.1e3 #path K=2 0 80 #node 6.3e4 K=2 K=3 Figure 2: Number of K-hop relational paths w.r.t. the node count in extracted graphs on Common-senseQA

Advances in Artificial Intelligence EAAI 2019 Hon-olulu Hawaii USA January 27 - February 1 2019pages 7208ndash7215 AAAI Press

Dirk Weissenborn Tomas Kocisky and Chris Dyer2017 Dynamic integration of background knowl-edge in neural nlu systems arXiv preprintarXiv170602596

An Yang Quan Wang Jing Liu Kai Liu Yajuan LyuHua Wu Qiaoqiao She and Sujian Li 2019 En-hancing pre-trained language representations withrich knowledge for machine reading comprehensionIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2346ndash2357 Florence Italy Association for Compu-tational Linguistics

Bishan Yang and Tom Mitchell 2017 Leveragingknowledge bases in LSTMs for improving machinereading In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1 Long Papers) pages 1436ndash1446 Van-couver Canada Association for Computational Lin-guistics

Zhi-Xiu Ye Qian Chen Wen Wang and Zhen-HuaLing 2019 Align mask and select A sim-ple method for incorporating commonsense knowl-edge into language representation models CoRRabs190806725

Yuyu Zhang Hanjun Dai Kamil Toraman andLe Song 2018 Kgˆ2 Learning to reason scienceexam questions with contextual knowledge graphembeddings CoRR abs180512393

Chen Zhu Yu Cheng Zhe Gan Siqi Sun Tom Gold-stein and Jingjing Liu 2020 Freelb Enhanced ad-versarial training for natural language understandingIn International Conference on Learning Represen-tations

A Merging Types of Relations inConceptNet

Relation Merged Relation

AtLocationAtLocation

LocatedNear

CausesCausesCausesDesire

MotivatedByGoal

AntonymAntonym

DistinctFrom

HasSubevent

HasSubevent

HasFirstSubeventHasLastSubeventHasPrerequisite

EntailsMannerOf

IsAIsAInstanceOf

DefinedAs

PartOfPartOf

HasA

RelatedToRelatedToSimilarTo

Synonym

Table 7 Relations in ConceptNet that are being mergedin pre-processing RelationX indicates the reverse re-lation of RelationX

We merge relations that are close in semanticsas well as in the general usage of triple instancesin ConceptNet

B Dataset Split Specifications

Train Dev Test

CommonsenseQA (OF) 9 741 1 221 1 140CommonsenseQA (IH) 8 500 1 221 1 241OpenbookQA 4 957 500 500

Table 8 Numbers of instances in different datasetsplits

CommonsenseQA 7 and OpenbookQA 8 all havetheir leaderboards with training and development

7httpswwwtau-nlporgcommonsenseqa8httpsleaderboardallenaiorgopen_

book_qasubmissionspublic

set publicly available As the ground truth labelsfor CommonsenseQA are not readily available formodel analysis we take 1241 examples from of-ficial training examples as our in-house test ex-amples and regard the remaining 8500 ones asour in-house training examples (CommonsenseQA(IH))

C Implementation Details

CommonsenseQA OpenbookQA

BERT-BASE 3 times 10minus5 -

BERT-LARGE 2 times 10minus5 -

ROBERTA-LARGE 1 times 10minus5

1 times 10minus5

Table 9 Learning rate for text encoders on differentdatasets

CommonsenseQA OpenbookQA

RN 3 times 10minus4

3 times 10minus4

RGCN 1 times 10minus3

1 times 10minus3

GconAttn 3 times 10minus4

1 times 10minus4

MHGRN 1 times 10minus3

1 times 10minus3

Table 10 Learning rate for graph encoders on differentdatasets

Param

RN 399KRGCN 365KGconAttn 453KMHGRN 544K

Table 11 Numbers of parameters of different graph en-coders

Our models are implemented in PyTorch We usecross-entropy loss and RAdam (Liu et al 2019a)optimizer We find it beneficial to use separatelearning rates for the text encoder and the graph en-coder We tune learning rates for text encoders andgraph encoders on two datasets We first fine-tuneROBERTA-LARGE BERT-LARGE BERT-BASE

on CommonsenseQA and ROBERTA-LARGE onOpenbookQA respectively and choose a dataset-specific learning rate from 1times10

minus5 2times10

minus5 3times

10minus5 6 times 10

minus5 1 times 10

minus4 for each text encoderbased on the best performance on development setas listed in Table 9 We report the performance ofthese fine-tuned text encoders and also adopt theirdataset-specific optimal learning rates in joint train-ing with graph encoders For models that involve

KG the learning rate of their graph encoders arechosen from 1 times 10

minus4 3 times 10

minus4 1 times 10

minus3 3 times

10minus3 based on their best development set perfor-

mance with ROBERTA-LARGE as the text encoderWe report the optimal learning rates for graph en-coders in Table 10 In training we set the max-imum input sequence length to text encoders to64 batch size to 32 and perform early stoppingAristoRoBERTaV7+MHGRN is the only exceptionIn order to host fair comparison we follow Aris-toRoBERTaV7 and set the batch size to 16 maxinput sequence length to 256 and choose a decoderlearning rate from 1 times 10

minus3 2 times 10

minus5For the input node features we first use tem-

plates to turn knowledge triples in ConceptNet intosentences and feed them into pre-trained BERT-LARGE obtaining a sequence of token embeddingsfrom the last layer of BERT-LARGE for each tripleFor each entity in ConceptNet we perform meanpooling over the tokens of the entityrsquos occurrencesacross all the sentences to form a 1024d vector asits corresponding node feature We use this set offeatures for all our implemented models

We use 2-layer RGCN and single-layer MHGRNacross our experiments

The numbers of parameter for each graph en-coder are listed in Table 11

D Dynamic Programming Algorithm forEq 7

To show that multi-hop message passing can becomputed in linear time we observe that Eq 7 canbe re-written in matrix form

Zk= (Dk)minus1 sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FXW1r1

⊤⋯W

krk

sdotWk+10

⊤⋯W

K0

⊤ (1 le k le K) (13)

where G = diag(exp([g(φ(v1) s) g(φ(vn)s)]) (F is similarly defined) Ar is the adjacencymatrix for relation r and D

k is defined as follows

Dk= diag( sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FX1) (1 le k le K) (14)

Using this matrix formulation we can compute Eq7 using dynamic programming

Algorithm 1 Dynamic programming algorithm formulti-hop message passingInput sXAr(1 le r le m)W t

r (r isin R 1 le t le

k)F G δ τOutput Z

1 WKlarr I

2 for k larr K minus 1 to 1 do3 W

klarrW

k+10 W

k+1

4 end for5 for r isin R do6 Mr larr FX7 end for8 for k larr 1 to K do9 if k gt 1 then

10 for r isin R do11 M

primer larr e

δ(rs)ArsumrprimeisinR e

τ(rprimers)MrprimeW

kr

12 end for13 for r isin R do14 Mr larrM

primer

15 end for16 else17 for r isin R do18 Mr larr e

δ(rs) sdotArMrWkr

19 end for20 end if21 Z

klarr GsumrisinR MrW

k

22 end for23 Replace W

tr (0 le r le m 1 le t le k) with identity

matrices and X with 1 and re-run line 1 - line 19 tocompute d

1 d

K

24 for k larr 1 to K do25 Z

klarr (diag(dk))minus1

Zk

26 end for27 return Z

1Z

2 Z

k

E Formal Definition of K-hop RN

Definition 1 (K-hop Relation Network) A multi-hop relation network is a function that maps amulti-relational graph to a fixed size vector

KHopRN(G W E H) =K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)sdotW (hjoplus(er1⋯erk)oplushi)(15)

where denotes element-wise product and β(⋯) =1(K∣A∣ sdot ∣(j prime i) isin G ∣ j prime isin Q∣) defines thepooling weights

F Expressing K-hop RN with MultiGRN

Theorem 1 Given any W E H there exists a pa-rameter setting such that the output of the modelbecomes KHopRN(GW E H) for arbitrary G

Proof Suppose W = [W1 W2 W3] whereW1 W3 isin Rd3timesd1 W2 isin Rd3timesd2 ForMHRGN we set the parameters as followsH = HUlowast = [I0] isin R(d1+d2)timesd1 blowast =

[01]⊤ isin Rd1+d2 Wtr = diag(1 oplus er) isin

R(d1+d2)times(d1+d2)(r isin R 1 le t le K)V = W3 isin

Rd3timesd1 Vprime= [W1 W2] isin Rd3times(d1+d2) We dis-

able the relation type attention module and enablemessage passing only from Q to A By furtherchoosing σ as the identity function and perform-ing pooling over A we observe that the output of

MultiGRN becomes

1

∣A∣ sumiisinA

hprimei

=1

∣A∣ (V hi + Vprimezi)

=1

K∣A∣K

sumk=1

(V hi + Vprimezki )

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)(V hi+

VprimeW

krk⋯W

1r1xj)

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ

β(⋯)(V hi + VprimeW

krk

⋯W1r1Uφ(j)hj + V

primeW

krk⋯W

1r1bφ(j))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)(W3hi + W1hj

+ W2(er1 ⋯ erk))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)W (hjoplus

(er1 ⋯ erk)oplus hi)= RN(G W E H)

(16)

Page 12: arXiv:2005.00646v1 [cs.CL] 1 May 20200 80 #node 3.1e3 #path K=2 0 80 #node 6.3e4 K=2 K=3 Figure 2: Number of K-hop relational paths w.r.t. the node count in extracted graphs on Common-senseQA

A Merging Types of Relations inConceptNet

Relation Merged Relation

AtLocationAtLocation

LocatedNear

CausesCausesCausesDesire

MotivatedByGoal

AntonymAntonym

DistinctFrom

HasSubevent

HasSubevent

HasFirstSubeventHasLastSubeventHasPrerequisite

EntailsMannerOf

IsAIsAInstanceOf

DefinedAs

PartOfPartOf

HasA

RelatedToRelatedToSimilarTo

Synonym

Table 7 Relations in ConceptNet that are being mergedin pre-processing RelationX indicates the reverse re-lation of RelationX

We merge relations that are close in semanticsas well as in the general usage of triple instancesin ConceptNet

B Dataset Split Specifications

Train Dev Test

CommonsenseQA (OF) 9 741 1 221 1 140CommonsenseQA (IH) 8 500 1 221 1 241OpenbookQA 4 957 500 500

Table 8 Numbers of instances in different datasetsplits

CommonsenseQA 7 and OpenbookQA 8 all havetheir leaderboards with training and development

7httpswwwtau-nlporgcommonsenseqa8httpsleaderboardallenaiorgopen_

book_qasubmissionspublic

set publicly available As the ground truth labelsfor CommonsenseQA are not readily available formodel analysis we take 1241 examples from of-ficial training examples as our in-house test ex-amples and regard the remaining 8500 ones asour in-house training examples (CommonsenseQA(IH))

C Implementation Details

CommonsenseQA OpenbookQA

BERT-BASE 3 times 10minus5 -

BERT-LARGE 2 times 10minus5 -

ROBERTA-LARGE 1 times 10minus5

1 times 10minus5

Table 9 Learning rate for text encoders on differentdatasets

CommonsenseQA OpenbookQA

RN 3 times 10minus4

3 times 10minus4

RGCN 1 times 10minus3

1 times 10minus3

GconAttn 3 times 10minus4

1 times 10minus4

MHGRN 1 times 10minus3

1 times 10minus3

Table 10 Learning rate for graph encoders on differentdatasets

Param

RN 399KRGCN 365KGconAttn 453KMHGRN 544K

Table 11 Numbers of parameters of different graph en-coders

Our models are implemented in PyTorch We usecross-entropy loss and RAdam (Liu et al 2019a)optimizer We find it beneficial to use separatelearning rates for the text encoder and the graph en-coder We tune learning rates for text encoders andgraph encoders on two datasets We first fine-tuneROBERTA-LARGE BERT-LARGE BERT-BASE

on CommonsenseQA and ROBERTA-LARGE onOpenbookQA respectively and choose a dataset-specific learning rate from 1times10

minus5 2times10

minus5 3times

10minus5 6 times 10

minus5 1 times 10

minus4 for each text encoderbased on the best performance on development setas listed in Table 9 We report the performance ofthese fine-tuned text encoders and also adopt theirdataset-specific optimal learning rates in joint train-ing with graph encoders For models that involve

KG the learning rate of their graph encoders arechosen from 1 times 10

minus4 3 times 10

minus4 1 times 10

minus3 3 times

10minus3 based on their best development set perfor-

mance with ROBERTA-LARGE as the text encoderWe report the optimal learning rates for graph en-coders in Table 10 In training we set the max-imum input sequence length to text encoders to64 batch size to 32 and perform early stoppingAristoRoBERTaV7+MHGRN is the only exceptionIn order to host fair comparison we follow Aris-toRoBERTaV7 and set the batch size to 16 maxinput sequence length to 256 and choose a decoderlearning rate from 1 times 10

minus3 2 times 10

minus5For the input node features we first use tem-

plates to turn knowledge triples in ConceptNet intosentences and feed them into pre-trained BERT-LARGE obtaining a sequence of token embeddingsfrom the last layer of BERT-LARGE for each tripleFor each entity in ConceptNet we perform meanpooling over the tokens of the entityrsquos occurrencesacross all the sentences to form a 1024d vector asits corresponding node feature We use this set offeatures for all our implemented models

We use 2-layer RGCN and single-layer MHGRNacross our experiments

The numbers of parameter for each graph en-coder are listed in Table 11

D Dynamic Programming Algorithm forEq 7

To show that multi-hop message passing can becomputed in linear time we observe that Eq 7 canbe re-written in matrix form

Zk= (Dk)minus1 sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FXW1r1

⊤⋯W

krk

sdotWk+10

⊤⋯W

K0

⊤ (1 le k le K) (13)

where G = diag(exp([g(φ(v1) s) g(φ(vn)s)]) (F is similarly defined) Ar is the adjacencymatrix for relation r and D

k is defined as follows

Dk= diag( sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FX1) (1 le k le K) (14)

Using this matrix formulation we can compute Eq7 using dynamic programming

Algorithm 1 Dynamic programming algorithm formulti-hop message passingInput sXAr(1 le r le m)W t

r (r isin R 1 le t le

k)F G δ τOutput Z

1 WKlarr I

2 for k larr K minus 1 to 1 do3 W

klarrW

k+10 W

k+1

4 end for5 for r isin R do6 Mr larr FX7 end for8 for k larr 1 to K do9 if k gt 1 then

10 for r isin R do11 M

primer larr e

δ(rs)ArsumrprimeisinR e

τ(rprimers)MrprimeW

kr

12 end for13 for r isin R do14 Mr larrM

primer

15 end for16 else17 for r isin R do18 Mr larr e

δ(rs) sdotArMrWkr

19 end for20 end if21 Z

klarr GsumrisinR MrW

k

22 end for23 Replace W

tr (0 le r le m 1 le t le k) with identity

matrices and X with 1 and re-run line 1 - line 19 tocompute d

1 d

K

24 for k larr 1 to K do25 Z

klarr (diag(dk))minus1

Zk

26 end for27 return Z

1Z

2 Z

k

E Formal Definition of K-hop RN

Definition 1 (K-hop Relation Network) A multi-hop relation network is a function that maps amulti-relational graph to a fixed size vector

KHopRN(G W E H) =K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)sdotW (hjoplus(er1⋯erk)oplushi)(15)

where denotes element-wise product and β(⋯) =1(K∣A∣ sdot ∣(j prime i) isin G ∣ j prime isin Q∣) defines thepooling weights

F Expressing K-hop RN with MultiGRN

Theorem 1 Given any W E H there exists a pa-rameter setting such that the output of the modelbecomes KHopRN(GW E H) for arbitrary G

Proof Suppose W = [W1 W2 W3] whereW1 W3 isin Rd3timesd1 W2 isin Rd3timesd2 ForMHRGN we set the parameters as followsH = HUlowast = [I0] isin R(d1+d2)timesd1 blowast =

[01]⊤ isin Rd1+d2 Wtr = diag(1 oplus er) isin

R(d1+d2)times(d1+d2)(r isin R 1 le t le K)V = W3 isin

Rd3timesd1 Vprime= [W1 W2] isin Rd3times(d1+d2) We dis-

able the relation type attention module and enablemessage passing only from Q to A By furtherchoosing σ as the identity function and perform-ing pooling over A we observe that the output of

MultiGRN becomes

1

∣A∣ sumiisinA

hprimei

=1

∣A∣ (V hi + Vprimezi)

=1

K∣A∣K

sumk=1

(V hi + Vprimezki )

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)(V hi+

VprimeW

krk⋯W

1r1xj)

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ

β(⋯)(V hi + VprimeW

krk

⋯W1r1Uφ(j)hj + V

primeW

krk⋯W

1r1bφ(j))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)(W3hi + W1hj

+ W2(er1 ⋯ erk))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)W (hjoplus

(er1 ⋯ erk)oplus hi)= RN(G W E H)

(16)

Page 13: arXiv:2005.00646v1 [cs.CL] 1 May 20200 80 #node 3.1e3 #path K=2 0 80 #node 6.3e4 K=2 K=3 Figure 2: Number of K-hop relational paths w.r.t. the node count in extracted graphs on Common-senseQA

KG the learning rate of their graph encoders arechosen from 1 times 10

minus4 3 times 10

minus4 1 times 10

minus3 3 times

10minus3 based on their best development set perfor-

mance with ROBERTA-LARGE as the text encoderWe report the optimal learning rates for graph en-coders in Table 10 In training we set the max-imum input sequence length to text encoders to64 batch size to 32 and perform early stoppingAristoRoBERTaV7+MHGRN is the only exceptionIn order to host fair comparison we follow Aris-toRoBERTaV7 and set the batch size to 16 maxinput sequence length to 256 and choose a decoderlearning rate from 1 times 10

minus3 2 times 10

minus5For the input node features we first use tem-

plates to turn knowledge triples in ConceptNet intosentences and feed them into pre-trained BERT-LARGE obtaining a sequence of token embeddingsfrom the last layer of BERT-LARGE for each tripleFor each entity in ConceptNet we perform meanpooling over the tokens of the entityrsquos occurrencesacross all the sentences to form a 1024d vector asits corresponding node feature We use this set offeatures for all our implemented models

We use 2-layer RGCN and single-layer MHGRNacross our experiments

The numbers of parameter for each graph en-coder are listed in Table 11

D Dynamic Programming Algorithm forEq 7

To show that multi-hop message passing can becomputed in linear time we observe that Eq 7 canbe re-written in matrix form

Zk= (Dk)minus1 sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FXW1r1

⊤⋯W

krk

sdotWk+10

⊤⋯W

K0

⊤ (1 le k le K) (13)

where G = diag(exp([g(φ(v1) s) g(φ(vn)s)]) (F is similarly defined) Ar is the adjacencymatrix for relation r and D

k is defined as follows

Dk= diag( sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FX1) (1 le k le K) (14)

Using this matrix formulation we can compute Eq7 using dynamic programming

Algorithm 1 Dynamic programming algorithm formulti-hop message passingInput sXAr(1 le r le m)W t

r (r isin R 1 le t le

k)F G δ τOutput Z

1 WKlarr I

2 for k larr K minus 1 to 1 do3 W

klarrW

k+10 W

k+1

4 end for5 for r isin R do6 Mr larr FX7 end for8 for k larr 1 to K do9 if k gt 1 then

10 for r isin R do11 M

primer larr e

δ(rs)ArsumrprimeisinR e

τ(rprimers)MrprimeW

kr

12 end for13 for r isin R do14 Mr larrM

primer

15 end for16 else17 for r isin R do18 Mr larr e

δ(rs) sdotArMrWkr

19 end for20 end if21 Z

klarr GsumrisinR MrW

k

22 end for23 Replace W

tr (0 le r le m 1 le t le k) with identity

matrices and X with 1 and re-run line 1 - line 19 tocompute d

1 d

K

24 for k larr 1 to K do25 Z

klarr (diag(dk))minus1

Zk

26 end for27 return Z

1Z

2 Z

k

E Formal Definition of K-hop RN

Definition 1 (K-hop Relation Network) A multi-hop relation network is a function that maps amulti-relational graph to a fixed size vector

KHopRN(G W E H) =K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)sdotW (hjoplus(er1⋯erk)oplushi)(15)

where denotes element-wise product and β(⋯) =1(K∣A∣ sdot ∣(j prime i) isin G ∣ j prime isin Q∣) defines thepooling weights

F Expressing K-hop RN with MultiGRN

Theorem 1 Given any W E H there exists a pa-rameter setting such that the output of the modelbecomes KHopRN(GW E H) for arbitrary G

Proof Suppose W = [W1 W2 W3] whereW1 W3 isin Rd3timesd1 W2 isin Rd3timesd2 ForMHRGN we set the parameters as followsH = HUlowast = [I0] isin R(d1+d2)timesd1 blowast =

[01]⊤ isin Rd1+d2 Wtr = diag(1 oplus er) isin

R(d1+d2)times(d1+d2)(r isin R 1 le t le K)V = W3 isin

Rd3timesd1 Vprime= [W1 W2] isin Rd3times(d1+d2) We dis-

able the relation type attention module and enablemessage passing only from Q to A By furtherchoosing σ as the identity function and perform-ing pooling over A we observe that the output of

MultiGRN becomes

1

∣A∣ sumiisinA

hprimei

=1

∣A∣ (V hi + Vprimezi)

=1

K∣A∣K

sumk=1

(V hi + Vprimezki )

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)(V hi+

VprimeW

krk⋯W

1r1xj)

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ

β(⋯)(V hi + VprimeW

krk

⋯W1r1Uφ(j)hj + V

primeW

krk⋯W

1r1bφ(j))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)(W3hi + W1hj

+ W2(er1 ⋯ erk))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)W (hjoplus

(er1 ⋯ erk)oplus hi)= RN(G W E H)

(16)

Page 14: arXiv:2005.00646v1 [cs.CL] 1 May 20200 80 #node 3.1e3 #path K=2 0 80 #node 6.3e4 K=2 K=3 Figure 2: Number of K-hop relational paths w.r.t. the node count in extracted graphs on Common-senseQA

E Formal Definition of K-hop RN

Definition 1 (K-hop Relation Network) A multi-hop relation network is a function that maps amulti-relational graph to a fixed size vector

KHopRN(G W E H) =K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)sdotW (hjoplus(er1⋯erk)oplushi)(15)

where denotes element-wise product and β(⋯) =1(K∣A∣ sdot ∣(j prime i) isin G ∣ j prime isin Q∣) defines thepooling weights

F Expressing K-hop RN with MultiGRN

Theorem 1 Given any W E H there exists a pa-rameter setting such that the output of the modelbecomes KHopRN(GW E H) for arbitrary G

Proof Suppose W = [W1 W2 W3] whereW1 W3 isin Rd3timesd1 W2 isin Rd3timesd2 ForMHRGN we set the parameters as followsH = HUlowast = [I0] isin R(d1+d2)timesd1 blowast =

[01]⊤ isin Rd1+d2 Wtr = diag(1 oplus er) isin

R(d1+d2)times(d1+d2)(r isin R 1 le t le K)V = W3 isin

Rd3timesd1 Vprime= [W1 W2] isin Rd3times(d1+d2) We dis-

able the relation type attention module and enablemessage passing only from Q to A By furtherchoosing σ as the identity function and perform-ing pooling over A we observe that the output of

MultiGRN becomes

1

∣A∣ sumiisinA

hprimei

=1

∣A∣ (V hi + Vprimezi)

=1

K∣A∣K

sumk=1

(V hi + Vprimezki )

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)(V hi+

VprimeW

krk⋯W

1r1xj)

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ

β(⋯)(V hi + VprimeW

krk

⋯W1r1Uφ(j)hj + V

primeW

krk⋯W

1r1bφ(j))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)(W3hi + W1hj

+ W2(er1 ⋯ erk))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)W (hjoplus

(er1 ⋯ erk)oplus hi)= RN(G W E H)

(16)