Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Scalable Multi-Hop Relational Reasoning forKnowledge-Aware Question Answering
Yanlin Fengclubslowast Xinyue Chenspadeslowast Bill Yuchen Linhearts Peifeng Wanghearts Jun Yanhearts Xiang Renhearts
fengyanlinpkueducn xinyuechandrewcmueduyuchenlin peifengw yanjun xiangrenuscedu
heartsUniversity of Southern CaliforniaclubsPeking University spadesCarnegie Mellon University
Abstract
Existing work on augmenting question an-swering (QA) models with external knowl-edge (eg knowledge graphs) either strug-gle to model multi-hop relations efficientlyor lack transparency into the modelrsquos predic-tion rationale In this paper we propose anovel knowledge-aware approach that equipspre-trained language models (PTLMs) with amulti-hop relational reasoning module namedmulti-hop graph relation network (MHGRN)It performs multi-hop multi-relational reason-ing over subgraphs extracted from externalknowledge graphs The proposed reasoningmodule unifies path-based reasoning methodsand graph neural networks to achieve better in-terpretability and scalability We also empiri-cally show its effectiveness and scalability onCommonsenseQA and OpenbookQA datasetsand interpret its behaviors with case studies1
1 Introduction
Many recently proposed question answering tasksrequire not only machine comprehension of thequestion and context but also relational reason-ing over entities (concepts) and their relationshipsby referencing external knowledge (Talmor et al2019 Sap et al 2019 Clark et al 2018 Mihaylovet al 2018) For example the question in Fig 1requires a model to perform relational reasoningover mentioned entities ie to infer latent rela-tions among the concepts CHILD SIT DESKSCHOOLROOM Background knowledge such as
ldquoa child is likely to appear in a schoolroomrdquo may notbe readily contained in the questions themselvesbut are commonsensical to humans
Despite the success of large-scale pre-trainedlanguage models (PTLMs) (Devlin et al 2019
lowast The first two authors contributed equally The majorwork was done when both authors interned at USC
1httpsgithubcomINK-USCMHGRN
Des
ires
Learn
Child
Schoolroom
Classroom
Sit
Desk
AtLo
cation
Where does a child likely sit at a desk
A Schoolroom B Furniture store C PatioD Office building E Library
Figure 1 Illustration of knowledge-aware QA Asample question from CommonsenseQA can be betteranswered if a relevant subgraph of ConceptNet is pro-vided as evidence Blue nodes correspond to entitiesmentioned in the question pink nodes correspond tothose in the answer The other nodes are some associ-ated entities introduced when extracting the subgraph⋆ indicates the correct answer
Liu et al 2019b) these models fall short of pro-viding interpretable predictions as the knowledgein their pre-training corpus is not explicitly statedbut rather is implicitly learned It is thus difficult torecover the evidence used in the reasoning process
This has led many to leverage knowledgegraphs (KGs) (Mihaylov and Frank 2018 Linet al 2019 Wang et al 2019 Yang et al 2019)KGs represent relational knowledge between en-tities with multi-relational edges for models toacquire Incorporating KGs brings the potentialof interpretable and trustworthy predictions asthe knowledge is now explicitly stated For ex-ample in Fig 1 the relational path (CHILD rarr
AtLocation rarr CLASSROOM rarr Synonym rarrSCHOOLROOM) naturally provides evidence forthe answer SCHOOLROOM
arX
iv2
005
0064
6v2
[cs
CL
] 1
8 Se
p 20
20
0 80node
31e3
pat
hK=2
0 80node
63e4K=2K=3
Figure 2 Number of K-hop relational paths wrtthe node count in extracted graphs on Common-senseQA Left The path count is polynomial wrt thenumber of nodes Right The path count is exponentialwrt the number of hops
A straightforward approach to leveraging aknowledge graph is to directly model these rela-tional paths KagNet (Lin et al 2019) and MH-PGM (Bauer et al 2018) model multi-hop relationsby extracting relational paths from KG and encod-ing them with sequence models Application ofattention mechanisms upon these relational pathscan further offer good interpretability Howeverthese models are hardly scalable because the num-ber of possible paths in a graph is (1) polynomialwrt the number of nodes (2) exponential wrt thepath length (see Fig 2) Therefore some (Weis-senborn et al 2017 Mihaylov and Frank 2018)resort to only using one-hop paths namely triplesto balance scalability and reasoning capacities
Graph neural networks (GNNs) in contrast en-joy better scalability via their message passing for-mulation but usually lack transparency The mostcommonly used GNNsrsquo variant Graph Convolu-tional Networks (GCNs) (Kipf and Welling 2017)perform message passing by aggregating neighbor-hood information for each node but ignore the rela-tion types RGCNs (Schlichtkrull et al 2018) gen-eralize GCNs by performing relation-specific ag-gregation making it applicable to multi-relationalgraphs However these models do not distinguishthe importance of different neighbors or relationtypes and thus cannot provide explicit relationalpaths for model behavior interpretation
In this paper we propose a novel graph encod-ing architecture Multi-hop Graph Relation Net-work (MHGRN) which combines the strengths ofpath-based models and GNNs Our model inheritsscalability from GNNs by preserving the messagepassing formulation It also enjoys interpretabilityof path-based models by incorporating structuredrelational attention mechanism Our key motivationis to perform multi-hop message passing within a
GCN RGCN KagNet MHGRN
Multi-Relational Encoding 7 3 3 3Interpretable 7 7 3 3
Scalable wrt node 3 3 7 3Scalable wrt hop 3 3 7 3
Table 1 Properties of our MHGRN and other repre-sentative models for graph encoding
single layer to allow each node to directly attend toits multi-hop neighbours towards multi-hop rela-tional reasoning We outline the favorable featuresof knowledge-aware QA models in Table 1 andcompare MHGRN with them
We summarize the main contributions of thiswork as follows 1) We propose MHGRN a novelmodel architecture tailored to multi-hop relationalreasoning which explicitly models multi-hop rela-tional paths at scale 2) We propose a structuredrelational attention mechanism for efficient and in-terpretable modeling of multi-hop reasoning pathsalong with its training and inference algorithms 3)We conduct extensive experiments on two questionanswering datasets and show that our models bringsignificant improvements compared to knowledge-agnostic PTLMs and outperform other graph en-coding methods by a large margin
2 Problem Formulation and Overview
In this paper we limit the scope to the task ofmultiple-choice question answering although itcan be easily generalized to other knowledge-guided tasks (eg natural language inference) Theoverall paradigm of knowledge-aware QA is illus-trated in Fig 3 Formally given an external knowl-edge graph (KG) as the knowledge source and aquestion q our goal is to identify the correct answerfrom a set C of given options We turn this probleminto measuring the plausibility score between q andeach option a isin C then selecting the one with thehighest plausibility score
Denote the vector representations of question qand option a as q and a To measure the scorefor q and a we first concatenate them to forma statement vector s = [qa] Then we extractfrom the external KG a subgraph G (ie schemagraph in KagNet (Lin et al 2019)) with the guid-ance of s (detailed in sect51) This contextualizedsubgraph is defined as a multi-relational graphG = (V E φ) Here V is a subset of entity inthe external KG containing only those relevant tos E sube V timesR times V is the set of edges that connectnodes in V where R = 1⋯m are ids of all
PlausibilityScore
Graph ExtractionFrom KG
Graph Encoder
Text Encoder
120022
119956119954
119938
Figure 3 Overview of the knowledge-aware QAframework It integrates the output from graph en-coder (for relational reasoning over contextual sub-graphs) and text encoder (for textual understanding) togenerate the plausibility score for an answer option
pre-defined relation types The mapping functionφ(i) ∶ V rarr T = EqEaEo takes node i isin V asinput and outputs Eq if i is an entity mentioned inq Ea if it is mentioned in a or Eo otherwise Wefinally encode the statement to s G to g concate-nate s and g for calculating the plausibility score
3 Background Multi-Relational GraphEncoding Methods
We leave encoding of s to pre-trained languagemodels and focus on the challenge of encodinggraph G to capture latent relations between enti-ties Current methods for encoding multi-relationalgraphs mainly fall into two categories GNNs andpath-based models GNNs encode structured in-formation by passing messages between nodes di-rectly operating on the graph structure while path-based methods first decompose the graph into pathsand then pool features over them
Graph Encoding with GNNs For a graph withn nodes a graph neural network (GNN) takes aset of node features h1h2 hn as input andcomputes their corresponding node embeddingshprime1hprime2 hprimen via message passing (Gilmeret al 2017) A compact graph representation forG can thus be obtained by pooling over the nodeembeddings hprimei
GNN(G) = Pool(hprime1hprime2 hprimen) (1)
As a notable variant of GNNs graph convo-lutional networks (GCNs) (Kipf and Welling2017) additionally update node embeddings byaggregating messages from its direct neighborsRGCNs (Schlichtkrull et al 2018) extend GCNs toencode multi-relational graphs by defining relation-
specific weight matrix Wr for each edge type
hprimei = σ
⎛⎜⎝(sumrisinR
∣N ri ∣)
minus1
sumrisinR
sumjisinN r
i
Wrhj⎞⎟⎠ (2)
where N ri denotes neighbors of node i under rela-
tion r2
While GNNs have proved to have good scala-bility their reasoning is done at the node levelmaking them incompatible with modeling pathsThis property also hinders the modelrsquos decisionsfrom being interpretable at the path levelGraph Encoding with Path-Based Models Inaddition to directly modeling the graph with GNNsa graph can also be viewed as a set of relationalpaths connecting pairs of entities
Relation Networks (RNs) (Santoro et al 2017)can be adapted to multi-relational graph encodingunder QA settings RNs use MLPs to encode alltriples (one-hop paths) in G whose head entity isin Q = j ∣ φ(j) = Eq and tail entity is inA = i ∣ φ(i) = Ea It then pools the tripleembeddings to generate a vector for G as follows
RN(G) = Pool(MLP(hj oplus eroplus
hi) ∣ j isin Q i isin A (j r i) isin E) (3)
Here hj and hi are features for nodes j and i er isthe embedding of relation r isin R oplus denotes vectorconcatenation
To further equip RN with the ability to modelnondegenerate paths KagNet (Lin et al 2019)adopts LSTMs to encode all paths connecting ques-tion entities and answer entities with lengths nomore than K It then aggregates all path embed-dings via attention mechanism
KAGNET(G) = Pool(LSTM(j r1 rk i) ∣
(j r1 j1)⋯ (jkminus1 rk i) isin E 1 le k le K)(4)
4 Proposed Method Multi-Hop GraphRelation Network (MHGRN)
This section presents Multi-hop Graph RelationNetwork (MHGRN) a novel GNN architecturethat unifies both GNNs and path-based modelsMHGRN inherits path-level reasoning and inter-pretabilty from path-based models while preserv-ing good scalability of GNNs
2For simplicity we assume a single graph convolutional
AttentivePooling
2 Multi-hop Message Passing
Node Features
Text Encodereg BERT
1 Node Type Specific Transform
3 Nonlinear Activation
timesLayer
PlausibilityScore
MLP
120022 119956
AggregateMessages
120630(119947 119955120783 hellip 119955119948 119946)
Structured Relational Att
Learn
Child
Child CapableOf UsedForminus1 AtLocation
Desires UsedForminus1 Synonym
AtLocation Synonym
Sit AtLocationUsedForminus1
Desk AtLocation
hellipMulti-hop Message Passing
Figure 4 Our proposed MHGRN architecturefor relational reasoning MHGRN takes a multi-relational graph G and a (question-answer) statementvector s as input and outputs a scalar that represent theplausibility score of this statement
41 MHGRN Model ArchitectureWe follow the GNN framework introduced in sect3where node features can be initialized with pre-trained weights (details in Appendix C) Here wefocus on the computation of node embeddings
Type-Specific Transformation To make ourmodel aware of the node type φ we first performnode type specific linear transformation on the in-put node features
xi = Uφ(i)hi + bφ(i) (5)
where the learnable parameters U and b are specificto the type of node i
Multi-Hop Message Passing As mentioned be-fore our motivation is to endow GNNs with thecapability of directly modeling paths To this endwe propose to pass messages directly over all therelational paths of lengths up to K The set of validk-hop relational paths is defined as
Φk = (j r1 rk i) ∣ (j r1 j1)⋯ (jkminus1 rk i) isin E (1 le k le K) (6)
We perform k-hop (1 le k le K) message passingover these paths which is a generalization of the
layer In practice multiple layers are stacked to enable mes-sage passing from multi-hop neighbors
single-hop message passing in RGCNs (see Eq 2)
zki = sum
(jr1rki)isinΦk
α(j r1 rk i)dki sdotWK0
⋯Wk+10 W
krk⋯W
1r1xj (1 le k le K) (7)
where the Wtr (1 le t le K 0 le r le m) matrices
are learnable3 α(j r1 rk i) is an attentionscore elaborated in sect42 and d
ki = sum(j⋯i)isinΦk
α(j⋯i)is the normalization factor The W k
rk⋯W1r1 ∣ 1 le
r1 rk le m matrices can be interpreted as thelow rank approximation of a mtimes⋯timesmktimesdtimesdtensor that assigns a separate transformation foreach k-hop relation where d is the dim of xi
Incoming messages from paths of differentlengths are aggregated via attention mecha-nism (Vaswani et al 2017)
zi =K
sumk=1
softmax(bilinear(s zki )) sdot zki (8)
Non-linear Activation Finally we apply shortcutconnection and nonlinear activation to obtain theoutput node embeddings
hprimei = σ (V hi + V
primezi) (9)
where V and Vprime are learnable model parameters
and σ is a non-linear activation function
42 Structured Relational AttentionHere we work towards effectively parameterizingthe attention score α(j r1 rk i) in Eq 7 for allk-hop paths without introducing O(mk) parametersWe first regard it as the probability of a relationsequence (φ(j) r1 rk φ(i)) conditioned on s
α(j r1 rk i) = p (φ(j) r1 rk φ(i) ∣ s) (10)
which can naturally be modeled by a probabilisticgraphical model such as conditional random field(Lafferty et al 2001)
p (⋯ ∣ s)prop exp(f(φ(j) s) +k
sumt=1
δ(rt s)
+kminus1
sumt=1
τ(rt rt+1) + g(φ(i) s))
∆= β(r1 rk s)IacuteOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveNtildeOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveIuml
Relation Type Attention
sdot γ(φ(j) φ(i) s)IacuteOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveNtildeOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveIumlNode Type Attention
(11)
3W
t0(0 le t le K) are introduced as padding matrices
so that K transformations are applied regardless of k thusensuring comparable scale of zki across different k
Model Time Space
G is a dense graph
K-hop KagNet O (mKnK+1
K) O (mKnK+1
K)K-layer RGCN O (mn2
K) O (mnK)MHGRN O (m2
n2K) O (mnK)
G is a sparse graph with maximum node degree ∆ ≪ n
K-hop KagNet O (mKnK∆
K) O (mKnK∆
K)K-layer RGCN O (mnK∆) O (mnK)MHGRN O (m2
nK∆) O (mnK)
Table 2 Computation complexity of different K-hopreasoning models on a densesparse multi-relationalgraph with n nodes and m relation types Despite thequadratic complexity wrt m MHGRNrsquos time cost issimilar to RGCN on GPUs with parallelizable matrixmultiplications (cf Fig 7)
where f(sdot) δ(sdot) and g(sdot) are parameterized bytwo-layer MLPs and τ(sdot) by a transition matrixof shape m timesm Intuitively β(sdot) models the im-portance of a k-hop relation while γ(sdot) models theimportance of messages from node type φ(j) toφ(i) (eg the model can learn to pass messagesonly from question entities to answer entities)
Our model scores a k-hop relation by decom-posing it into both context-aware single-hop re-lations (modeled by δ) and two-hop relations(modeled by τ ) We argue that τ is indis-pensable without which the model may assignhigh importance to illogical multi-hop relations(eg [AtLocation CapableOf]) or noisy re-lations (eg [RelatedTo RelatedTo])
43 Computation Complexity Analysis
Although message passing process in Eq 7 andattention module in Eq11 handles potentially ex-ponential number of paths it can be computed inlinear time with dynamic programming (see Ap-pendix D) As summarized in Table 2 time com-plexity and space complexity of MHGRN on asparse graph are both linear wrt either the maxi-mum path length K or the number of nodes n
44 Expressive Power of MHGRN
In addition to efficiency and scalability we now dis-cuss the modeling capacity of MHGRN With themessage passing formulation and relation-specifictransformations it is by nature the generalizationof RGCN It is also capable of directly modelingpaths making it interpretable as are path-basedmodels like RN and KagNet To show this we
first generalize RN (Eq 3) to the multi-hop set-ting and introduce K-hop RN (formal definition inAppendix E) which models multi-hop relation asthe composition of single-hop relations We showthat MHGRN is capable of representingK-hop RN(proof in Appendix F)
45 Learning Inference and Path DecodingWe now discuss the learning and inference processof MHGRN instantiated for QA tasks Followingthe problem formulation in sect2 we aim to deter-mine the plausibility of an answer option a isin Cgiven the question q with the information fromboth text s and graph G We first obtain the graphrepresentation g by performing attentive poolingover the output node embeddings of answer entitieshprimei ∣ i isin A Next we concatenate it with the textrepresentation s and compute the plausibility scoreby ρ(qa) = MLP(soplus g)
During training we maximize the plausibilityscore of the correct answer a by minimizing thecross-entropy loss
L = EqaC [minus logexp(ρ(q a))
sumaisinC exp(ρ(qa))] (12)
The whole model is trained end-to-end jointly withthe text encoder (eg RoBERTa)
During inference we predict the most plau-sible answer by argmaxaisinC ρ(qa) Addition-ally we can decode a reasoning path as evidencefor model predictions endowing our model withthe interpretability enjoyed by path-based mod-els Specifically we first determine the answerentity ilowast with the highest score in the pooling layerand the path length klowast with the highest score inEq 8 Then the reasoning path is decoded byargmax α(j r1 rklowast i
lowast) which can be com-puted in linear time using dynamic programming
5 Experimental SetupWe introduce how we construct G (sect51) thedatasets (sect52) as well as the baseline methods(sect53) Appendix C shows more implementationand experimental details for reproducibility
51 Extracting G from External KGWe use ConceptNet (Speer et al 2017) a general-domain knowledge graph as our external KG to testmodelsrsquo ability to harness structured knowledgesource Following KagNet (Lin et al 2019) wemerge relation types to increase graph density andadd reverse relations to construct a multi-relational
Methods BERT-Base BERT-Large RoBERTa-Large
IHdev-Acc() IHtest-Acc() IHdev-Acc() IHtest-Acc() IHdev-Acc() IHtest-Acc()
wo KG 5731 (plusmn107) 5347 (plusmn087) 6106 (plusmn085) 5539 (plusmn040) 7307 (plusmn045) 6869(plusmn056)
RGCN (Schlichtkrull et al 2018) 5694 (plusmn038) 5450 (plusmn056) 6298 (plusmn082) 5713 (plusmn036) 7269 (plusmn019) 6841 (plusmn066)GconAttn (Wang et al 2019) 5727 (plusmn070) 5484 (plusmn088) 6317 (plusmn018) 5736 (plusmn090) 7261( plusmn039) 6859 (plusmn096)KagNetdagger (Lin et al 2019) 5557 5619 6235 5716 - -RN (1-hop) 5827 (plusmn022) 5620 (plusmn045) 6304 (plusmn058) 5846 (plusmn071) 7457 (plusmn091) 6908 (plusmn021)RN (2-hop) 5981 (plusmn076) 5661 (plusmn068) 6336 (plusmn026) 5892 (plusmn014) 7365 (plusmn309) 6959 (plusmn380)
MHGRN 6036 (plusmn023) 5723 (plusmn082) 6329(plusmn051) 6059 (plusmn058) 7445 (plusmn010) 7111 (plusmn081)
Table 3 Performance comparison on CommonsenseQA in-house split We report in-house Dev (IHdev) andTest (IHtest) accuracy (mean and standard deviation of four runs) using the data split of Lin et al (2019) onCommonsenseQA dagger indicates reported results in its paper
Methods Single Ensemble
UnifiedQAdagger (Khashabi et al 2020) 791 -
RoBERTadagger 721 725RoBERTa + KEDGNdagger 725 744RoBERTa + KEdagger 733 -RoBERTa + HyKAS 20dagger (Ma et al 2019) 732 -RoBERTa + FreeLBdagger(Zhu et al 2020) 722 731XLNet + DREAMdagger 669 733XLNet + GRdagger (Lv et al 2019) 753 -ALBERTdagger (Lan et al 2019) - 765
RoBERTa + MHGRN (K = 2) 754 765
Table 4 Performance comparison on official test ofCommonsenseQA with leaderboard SoTAs4 (accuracyin ) dagger indicates reported results on leaderboard Uni-fiedQA uses T5-11B as text encoder whose number ofparameters is about 30 times more than other models
graph with 34 relation types (details in AppendixA) To extract an informative contextualized graphG from KG we recognize entity mentions in s andlink them to entities in ConceptNet with which weinitialize our node set V We then add to V all theentities that appear in any two-hop paths betweenpairs of mentioned entities Unlike KagNet we donot perform any pruning but instead reserve all theedges between nodes in V forming our G
52 Datasets
We evaluate models on two multiple-choice ques-tion answering datasets CommonsenseQA andOpenbookQA Both require world knowledge be-yond textual understanding to perform well
CommonsenseQA (Talmor et al 2019) neces-sitates various commonsense reasoning skills Thequestions are created with entities from ConceptNetand they are designed to probe latent compositionalrelations between entities in ConceptNet
OpenBookQA (Mihaylov et al 2018) provides
4Models based on ConceptNet are no longer shown on theleaderboard and we got our results from the organizers
elementary science questions together with an openbook of science facts This dataset also probesgeneral common sense beyond the provided facts
53 Compared Methods
We implement both knowledge-agnostic fine-tuning of pre-trained LMs and models that incor-porate KG as external sources as our baselinesAdditionally we directly compare our model withthe results from corresponding leaderboard Thesemethods typically leverage textual knowledge orextra training data as opposed to external KG Inall our implemented models we use pre-trainedLMs as text encoders for s for fair comparisonWe do compare our models with those (Ma et al2019 Lv et al 2019 Khashabi et al 2020) aug-mented by other text-form external knowledge (egWikipedia) although we stick to our focus of en-coding structured KG
Specifically we fine-tune BERT-BASE BERT-LARGE (Devlin et al 2019) and ROBERTA (Liuet al 2019b) for multiple-choice questions Wetake RGCN (Eq 2 in sect3) RN5 (Eq 3 in sect3)KagNet (Eq 4 in sect3) and GconAttn (Wanget al 2019) as baselines GconAttn generalizesmatch-LSTM (Wang and Jiang 2016) and achievessuccess in language inference tasks
6 Results and Discussions
In this section we present the results of our modelsin comparison with baselines as well as methodson the leaderboards for both CommonsenseQA andOpenbookQA We also provide analysis of modelsrsquocomponents and characteristics
5We use mean pooling for 1-hop RN and attentive poolingfor 2-hop RN (detailed in Appendix E)
Methods Dev () Test ()
T5-3Bdagger (Raffel et al 2019) - 8320UnifiedQAdagger (Khashabi et al 2020) - 8720
RoBERTa-Large (wo KG) 6676 (plusmn114) 6480 (plusmn237)+ RGCN 6465 (plusmn196) 6245 (plusmn157)+ GconAttn 6430 (plusmn099) 6190 (plusmn244)+ RN (1-hop) 6485 (plusmn111) 6365 (plusmn231)+ RN (2-hop) 6700 (plusmn071) 6520 (plusmn118)+ MHGRN (K = 3) 6810 (plusmn102) 6685 (plusmn119)
AristoRoBERTaV7dagger 792 778+ MHGRN (K = 3) 786 806
Table 5 Performance comparison on OpenbookQAdagger indicates reported results on leaderboard T5-3B is 8times larger than our models UnifiedQA is 30x larger
61 Main ResultsFor CommonsenseQA (Table 3) we first use thein-house data split (IH) of Lin et al (2019) (cfAppendix B) to compare our models with imple-mented baselines This is different from the offi-cial split used in the leaderboard methods Almostall KG-augmented models achieve performancegain over vanilla pre-trained LMs demonstratingthe value of external knowledge on this dataset Ad-ditionally we evaluate our MHGRN (with text en-coder being ROBERTA-LARGE) on official splitOF (Table 4) for fair comparison with other meth-ods on leaderboard in both single-model settingand ensemble-model setting With backbone be-ing T5 (Raffel et al 2019) UnifiedQA (Khashabiet al 2020) tops the leaderboard Consideringits training cost we do not intend to compare ourROBERTA-based model with it We achieve thebest performances among other models
For OpenbookQA (Table 5) we use official splitand build models with ROBERTA-LARGE as textencoder MHGRN surpasses all implemented base-lines with an absolute increase of sim2 on TestAlso as our approach is naturally compatible withthe methods that utilize textual knowledge or ex-tra data because in our paradigm the encodingof textual statement and graph are structurally-decoupled (Fig 3) To empirically show MHGRNcan bring gain over textual-knowledge empow-ered systems we replace our text encoder withAristoRoBERTaV76 and fine-tune our MHGRNupon OpenbookQA Empirically MHGRN contin-ues to bring benefit to strong-performing textual-knowledge empowered systems This indicatestextual knowledge and structured knowledge canpotentially be complementary
6httpsleaderboardallenaiorgopen_book_qasubmissionblcp1tu91i4gm0vf484g
Methods IHdev-Acc ()
MHGRN (K = 3) 7445 (plusmn010)- Type-specific transformation (sect41) 7316 (plusmn028)- Structured relational attention (sect42) 7326 (plusmn031)- Relation type attention (sect42) 7355 (plusmn068)- Node type attention (sect42) 7392 (plusmn065)
Table 6 Ablation study on model components (re-moving one component each time) using ROBERTA-LARGE as the text encoder We report the IHdev ac-curacy on CommonsenseQA
62 Performance Analysis
Ablation Study on Model Components Asshown in Table 6 disabling type-specific trans-formation results in sim 13 drop in performancedemonstrating the need for distinguishing nodetypes for QA tasks Our structured relational at-tention mechanism is also critical with its twosub-components contributing almost equally
Impact of the Amount of Training Data Weuse different fractions of training data of Common-senseQA and report results of fine-tuning text en-coders alone and jointly training text encoder andgraph encoder in Fig 5 Regardless of training datafraction our model shows consistently more per-formance improvement over knowledge-agnosticfine-tuning compared with the other graph encod-ing methods indicating MHGRNrsquos complementarystrengths to text encoders
20 40 60 80 100Percentage of Training Data
45
50
55
60
65
Accu
racy
()
MultiGRNKVMGconAttnwo KG
Figure 5 Performance change (accuracy in ) wrtthe amounts of training data on CommonsenseQA IHT-est set (same as Lin et al (2019))
Impact of Number of Hops (K) We investigatethe impact of hyperparameter K for MHGRN byits performance on CommonsenseQA (Fig 6) Theincrease ofK continues to bring benefits untilK =
4 However performance begins to drop whenK gt 3 This might be attributed to exponentialnoise in longer relational paths in knowledge graph
65 70 75Accuracy ()
K=1
K=2
K=3
K=4
K=5
Reasoning Hops
7324
7393
7465
7365
7416
Figure 6 Effect ofK in MHGRN We show IHDev ac-curacy of MHGRN on CommonsenseQA wrt hops
2 4 6 800
05
10
15
20
sba
tch
K Reasoning Hops
RGCNMHGRN
Figure 7 Analysis of model scalability Comparisonof per-batch training efficiency wrt hops K
63 Model ScalabilityFig 7 presents the computation cost of MHGRNand RGCN (measured by training time) Both growlinearly wrt K Although the theoretical complex-ity of MultiRGN ism times that of RGCN the ratioof their empirical cost only approaches 2 demon-strating that our model can be better parallelized
64 Model InterpretabilityWe can analyze our modelrsquos reasoning process bydecoding the reasoning path using the method de-scribed in sect45 Fig 8 shows two examples fromCommonsenseQA where our model correctly an-swers the questions and provides reasonable pathevidences In the example on the left the modellinks question entities and answer entity in a chainto support reasoning while the example on the rightprovides a case where our model leverage unmen-tioned entities to bridge the reasoning gap betweenquestion entity and answer entities in a way that iscoherent with the latent relation between CHAPEL
and the desired answer in the question
7 Related WorkKnowledge-Aware Methods for NLP Variouswork have investigated the potential to empowerNLP models with external knowledge Many at-tempt to extract structured knowledge either in theform of nodes (Yang and Mitchell 2017 Wanget al 2019) triples (Weissenborn et al 2017
Where is known for a multitude of wedding chapels
A town B texasC city D church building E Nevada
RenoChapel
PartOf
AtLocation
UsedForminus1 Synonym
AtLocation Synonym
Neveda
Baseball
Play_Baseball
HasProperty
UsedForminus1
Fun_to_Play
Why do parents encourage their kids to play baseball
A round B cheap C break window D hardE fun to play Fun_to_Play
HasProperty
Reno
Chapel
AtLocation
Neveda
PartOf
MultiGRNFigure 8 Case study on model interpretability Wepresent two sample questions from CommonsenseQAwith the reasoning paths output by MHGRN
Mihaylov and Frank 2018) paths (Bauer et al2018 Kundu et al 2019 Lin et al 2019) or sub-graphs (Li and Clark 2015) and encode them toaugment textual understanding
Recent success of pre-trained LMs motivatesmany (Pan et al 2019 Ye et al 2019 Zhang et al2018 Li et al 2019 Banerjee et al 2019) to probeLMsrsquo potential as latent knowledge bases This lineof work turn to textual knowledge (eg Wikipedia)to directly impart knowledge to pre-trained LMsThey generally fall into two paradigms 1) Fine-tuning LMs on large-scale general-domain datasets(eg RACE (Lai et al 2017)) or on knowledge-richtext 2) Providing LMs with evidence via informa-tion retrieval techniques However these modelscannot provide explicit reasoning and evidencethus hardly trustworthy They are also subject tothe availability of in-domain datasets and maxi-mum input token of pre-trained LMs
Neural Graph Encoding Graph Attention Net-works (GAT) (Velickovic et al 2018) incorpo-rates attention mechanism in feature aggregationRGCN (Schlichtkrull et al 2018) proposes rela-tional message passing which makes it applicableto multi-relational graphs However they only per-form single-hop message passing and cannot beinterpreted at path level Other work (Abu-El-Haijaet al 2019 Nikolentzos et al 2019) aggregate fora node its K-hop neighbors based on node-wisedistances but they are designed for non-relationalgraphs MHGRN addresses these issues by rea-soning on multi-relational graphs and being inter-pretable via maintaining paths as reasoning chains
8 Conclusion
We present a principled scalable method MHGRNthat can leverage general knowledge via multi-hopreasoning over interpretable structures (eg Con-ceptNet) The proposed MHGRN generalizes andcombines the advantages of GNNs and path-basedreasoning models It explicitly performs multi-hop
relational reasoning and is empirically shown tooutperform existing methods with superior scal-ablility and interpretability
ReferencesSami Abu-El-Haija Bryan Perozzi Amol Kapoor
Nazanin Alipourfard Kristina Lerman Hrayr Haru-tyunyan Greg Ver Steeg and Aram Galstyan 2019Mixhop Higher-order graph convolutional architec-tures via sparsified neighborhood mixing In Pro-ceedings of the 36th International Conference onMachine Learning ICML 2019 9-15 June 2019Long Beach California USA volume 97 of Pro-ceedings of Machine Learning Research pages 21ndash29 PMLR
Pratyay Banerjee Kuntal Kumar Pal Arindam Mi-tra and Chitta Baral 2019 Careful selection ofknowledge to solve open book question answeringIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages6120ndash6129 Florence Italy Association for Compu-tational Linguistics
Lisa Bauer Yicheng Wang and Mohit Bansal 2018Commonsense for generative multi-hop question an-swering tasks In Proceedings of the 2018 Confer-ence on Empirical Methods in Natural LanguageProcessing Brussels Belgium October 31 - Novem-ber 4 2018 pages 4220ndash4230 Association for Com-putational Linguistics
Peter Clark Isaac Cowhey Oren Etzioni Tushar KhotAshish Sabharwal Carissa Schoenick and OyvindTafjord 2018 Think you have solved questionanswering try arc the AI2 reasoning challengeCoRR abs180305457
Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics
Justin Gilmer Samuel S Schoenholz Patrick F RileyOriol Vinyals and George E Dahl 2017 Neuralmessage passing for quantum chemistry In Pro-ceedings of the 34th International Conference onMachine Learning ICML 2017 Sydney NSW Aus-tralia 6-11 August 2017 volume 70 of Proceedingsof Machine Learning Research pages 1263ndash1272PMLR
Daniel Khashabi Tushar Khot Ashish SabharwalOyvind Tafjord Peter Clark and Hannaneh Ha-jishirzi 2020 Unifiedqa Crossing format bound-aries with a single qa system
Thomas N Kipf and Max Welling 2017 Semi-supervised classification with graph convolutionalnetworks In 5th International Conference on Learn-ing Representations ICLR 2017 Toulon FranceApril 24-26 2017 Conference Track ProceedingsOpenReviewnet
Souvik Kundu Tushar Khot Ashish Sabharwal andPeter Clark 2019 Exploiting explicit paths formulti-hop reading comprehension In Proceedingsof the 57th Annual Meeting of the Association forComputational Linguistics pages 2737ndash2747 Flo-rence Italy Association for Computational Linguis-tics
John D Lafferty Andrew McCallum and FernandoC N Pereira 2001 Conditional random fieldsProbabilistic models for segmenting and labeling se-quence data In Proceedings of the Eighteenth Inter-national Conference on Machine Learning (ICML2001) Williams College Williamstown MA USAJune 28 - July 1 2001 pages 282ndash289 MorganKaufmann
Guokun Lai Qizhe Xie Hanxiao Liu Yiming Yangand Eduard Hovy 2017 RACE Large-scale ReAd-ing comprehension dataset from examinations InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing pages785ndash794 Copenhagen Denmark Association forComputational Linguistics
Zhen-Zhong Lan Mingda Chen Sebastian Good-man Kevin Gimpel Piyush Sharma and RaduSoricut 2019 Albert A lite bert for self-supervised learning of language representationsArXiv abs190911942
Shiyang Li Jianshu Chen and Dian Yu 2019 Teach-ing pretrained models with commonsense reason-ing A preliminary kb-based approach CoRRabs190909743
Yang Li and Peter Clark 2015 Answering elementaryscience questions by constructing coherent scenesusing background knowledge In Proceedings ofthe 2015 Conference on Empirical Methods in Nat-ural Language Processing pages 2007ndash2012 Lis-bon Portugal Association for Computational Lin-guistics
Bill Yuchen Lin Xinyue Chen Jamin Chen and Xi-ang Ren 2019 KagNet Knowledge-aware graphnetworks for commonsense reasoning In Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) pages 2829ndash2839 HongKong China Association for Computational Lin-guistics
Liyuan Liu Haoming Jiang Pengcheng He WeizhuChen Xiaodong Liu Jianfeng Gao and Jiawei Han2019a On the variance of the adaptive learning rateand beyond CoRR abs190803265
Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019bRoberta A robustly optimized BERT pretraining ap-proach CoRR abs190711692
Shangwen Lv Daya Guo Jingjing Xu Duyu TangNan Duan Ming Gong Linjun Shou Daxin JiangGuihong Cao and Songlin Hu 2019 Graph-based reasoning over heterogeneous external knowl-edge for commonsense question answering CoRRabs190905311
Kaixin Ma Jonathan Francis Quanyang Lu Eric Ny-berg and Alessandro Oltramari 2019 Towardsgeneralizable neuro-symbolic systems for common-sense question answering In Proceedings of theFirst Workshop on Commonsense Inference in Natu-ral Language Processing pages 22ndash32 Hong KongChina Association for Computational Linguistics
Todor Mihaylov Peter Clark Tushar Khot and AshishSabharwal 2018 Can a suit of armor conduct elec-tricity a new dataset for open book question an-swering In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processingpages 2381ndash2391 Brussels Belgium Associationfor Computational Linguistics
Todor Mihaylov and Anette Frank 2018 Knowledge-able reader Enhancing cloze-style reading compre-hension with external commonsense knowledge InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 821ndash832 Melbourne AustraliaAssociation for Computational Linguistics
Giannis Nikolentzos George Dasoulas and MichalisVazirgiannis 2019 k-hop graph neural networksCoRR abs190706051
Xiaoman Pan Kai Sun Dian Yu Heng Ji and Dong Yu2019 Improving question answering with externalknowledge CoRR abs190200993
Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2019 Exploring the limitsof transfer learning with a unified text-to-text trans-former
Adam Santoro David Raposo David G T BarrettMateusz Malinowski Razvan Pascanu Peter WBattaglia and Tim Lillicrap 2017 A simple neu-ral network module for relational reasoning In Ad-vances in Neural Information Processing Systems30 Annual Conference on Neural Information Pro-cessing Systems 2017 4-9 December 2017 LongBeach CA USA pages 4967ndash4976
Maarten Sap Hannah Rashkin Derek Chen RonanLe Bras and Yejin Choi 2019 Social IQa Com-monsense reasoning about social interactions InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the
9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) pages 4463ndash4473 Hong Kong China Association for Computa-tional Linguistics
Michael Sejr Schlichtkrull Thomas N Kipf PeterBloem Rianne van den Berg Ivan Titov and MaxWelling 2018 Modeling relational data with graphconvolutional networks In The Semantic Web - 15thInternational Conference ESWC 2018 HeraklionCrete Greece June 3-7 2018 Proceedings volume10843 of Lecture Notes in Computer Science pages593ndash607 Springer
Robyn Speer Joshua Chin and Catherine Havasi 2017Conceptnet 55 An open multilingual graph of gen-eral knowledge In Proceedings of the Thirty-FirstAAAI Conference on Artificial Intelligence Febru-ary 4-9 2017 San Francisco California USApages 4444ndash4451 AAAI Press
Alon Talmor Jonathan Herzig Nicholas Lourie andJonathan Berant 2019 CommonsenseQA A ques-tion answering challenge targeting commonsenseknowledge In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4149ndash4158 Minneapolis Minnesota Associ-ation for Computational Linguistics
Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez LukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems 30 Annual Conference on NeuralInformation Processing Systems 2017 4-9 Decem-ber 2017 Long Beach CA USA pages 5998ndash6008
Petar Velickovic Guillem Cucurull Arantxa CasanovaAdriana Romero Pietro Lio and Yoshua Bengio2018 Graph attention networks In 6th Inter-national Conference on Learning RepresentationsICLR 2018 Vancouver BC Canada April 30 - May3 2018 Conference Track Proceedings OpenRe-viewnet
Shuohang Wang and Jing Jiang 2016 Learning nat-ural language inference with LSTM In Proceed-ings of the 2016 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics Human Language Technologies pages1442ndash1451 San Diego California Association forComputational Linguistics
Xiaoyan Wang Pavan Kapanipathi Ryan MusaMo Yu Kartik Talamadupula Ibrahim AbdelazizMaria Chang Achille Fokoue Bassem MakniNicholas Mattei and Michael Witbrock 2019 Im-proving natural language inference using externalknowledge in the science questions domain In TheThirty-Third AAAI Conference on Artificial Intelli-gence AAAI 2019 The Thirty-First Innovative Ap-plications of Artificial Intelligence Conference IAAI2019 The Ninth AAAI Symposium on Educational
Advances in Artificial Intelligence EAAI 2019 Hon-olulu Hawaii USA January 27 - February 1 2019pages 7208ndash7215 AAAI Press
Dirk Weissenborn Tomas Kocisky and Chris Dyer2017 Dynamic integration of background knowl-edge in neural nlu systems arXiv preprintarXiv170602596
An Yang Quan Wang Jing Liu Kai Liu Yajuan LyuHua Wu Qiaoqiao She and Sujian Li 2019 En-hancing pre-trained language representations withrich knowledge for machine reading comprehensionIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2346ndash2357 Florence Italy Association for Compu-tational Linguistics
Bishan Yang and Tom Mitchell 2017 Leveragingknowledge bases in LSTMs for improving machinereading In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1 Long Papers) pages 1436ndash1446 Van-couver Canada Association for Computational Lin-guistics
Zhi-Xiu Ye Qian Chen Wen Wang and Zhen-HuaLing 2019 Align mask and select A sim-ple method for incorporating commonsense knowl-edge into language representation models CoRRabs190806725
Yuyu Zhang Hanjun Dai Kamil Toraman andLe Song 2018 Kgˆ2 Learning to reason scienceexam questions with contextual knowledge graphembeddings CoRR abs180512393
Chen Zhu Yu Cheng Zhe Gan Siqi Sun Tom Gold-stein and Jingjing Liu 2020 Freelb Enhanced ad-versarial training for natural language understandingIn International Conference on Learning Represen-tations
A Merging Types of Relations inConceptNet
Relation Merged Relation
AtLocationAtLocation
LocatedNear
CausesCausesCausesDesire
MotivatedByGoal
AntonymAntonym
DistinctFrom
HasSubevent
HasSubevent
HasFirstSubeventHasLastSubeventHasPrerequisite
EntailsMannerOf
IsAIsAInstanceOf
DefinedAs
PartOfPartOf
HasA
RelatedToRelatedToSimilarTo
Synonym
Table 7 Relations in ConceptNet that are being mergedin pre-processing RelationX indicates the reverse re-lation of RelationX
We merge relations that are close in semanticsas well as in the general usage of triple instancesin ConceptNet
B Dataset Split Specifications
Train Dev Test
CommonsenseQA (OF) 9 741 1 221 1 140CommonsenseQA (IH) 8 500 1 221 1 241OpenbookQA 4 957 500 500
Table 8 Numbers of instances in different datasetsplits
CommonsenseQA 7 and OpenbookQA 8 all havetheir leaderboards with training and development
7httpswwwtau-nlporgcommonsenseqa8httpsleaderboardallenaiorgopen_
book_qasubmissionspublic
set publicly available As the ground truth labelsfor CommonsenseQA are not readily available formodel analysis we take 1241 examples from of-ficial training examples as our in-house test ex-amples and regard the remaining 8500 ones asour in-house training examples (CommonsenseQA(IH))
C Implementation Details
CommonsenseQA OpenbookQA
BERT-BASE 3 times 10minus5 -
BERT-LARGE 2 times 10minus5 -
ROBERTA-LARGE 1 times 10minus5
1 times 10minus5
Table 9 Learning rate for text encoders on differentdatasets
CommonsenseQA OpenbookQA
RN 3 times 10minus4
3 times 10minus4
RGCN 1 times 10minus3
1 times 10minus3
GconAttn 3 times 10minus4
1 times 10minus4
MHGRN 1 times 10minus3
1 times 10minus3
Table 10 Learning rate for graph encoders on differentdatasets
Param
RN 399KRGCN 365KGconAttn 453KMHGRN 544K
Table 11 Numbers of parameters of different graph en-coders
Our models are implemented in PyTorch We usecross-entropy loss and RAdam (Liu et al 2019a)optimizer We find it beneficial to use separatelearning rates for the text encoder and the graph en-coder We tune learning rates for text encoders andgraph encoders on two datasets We first fine-tuneROBERTA-LARGE BERT-LARGE BERT-BASE
on CommonsenseQA and ROBERTA-LARGE onOpenbookQA respectively and choose a dataset-specific learning rate from 1times10
minus5 2times10
minus5 3times
10minus5 6 times 10
minus5 1 times 10
minus4 for each text encoderbased on the best performance on development setas listed in Table 9 We report the performance ofthese fine-tuned text encoders and also adopt theirdataset-specific optimal learning rates in joint train-ing with graph encoders For models that involve
KG the learning rate of their graph encoders arechosen from 1 times 10
minus4 3 times 10
minus4 1 times 10
minus3 3 times
10minus3 based on their best development set perfor-
mance with ROBERTA-LARGE as the text encoderWe report the optimal learning rates for graph en-coders in Table 10 In training we set the max-imum input sequence length to text encoders to64 batch size to 32 and perform early stoppingAristoRoBERTaV7+MHGRN is the only exceptionIn order to host fair comparison we follow Aris-toRoBERTaV7 and set the batch size to 16 maxinput sequence length to 256 and choose a decoderlearning rate from 1 times 10
minus3 2 times 10
minus5For the input node features we first use tem-
plates to turn knowledge triples in ConceptNet intosentences and feed them into pre-trained BERT-LARGE obtaining a sequence of token embeddingsfrom the last layer of BERT-LARGE for each tripleFor each entity in ConceptNet we perform meanpooling over the tokens of the entityrsquos occurrencesacross all the sentences to form a 1024d vector asits corresponding node feature We use this set offeatures for all our implemented models
We use 2-layer RGCN and single-layer MHGRNacross our experiments
The numbers of parameter for each graph en-coder are listed in Table 11
D Dynamic Programming Algorithm forEq 7
To show that multi-hop message passing can becomputed in linear time we observe that Eq 7 canbe re-written in matrix form
Zk= (Dk)minus1 sum
(r1rk)isinRk
β(r1 rk s)
sdotGArk⋯Ar1FXW1r1
⊤⋯W
krk
⊤
sdotWk+10
⊤⋯W
K0
⊤ (1 le k le K) (13)
where G = diag(exp([g(φ(v1) s) g(φ(vn)s)]) (F is similarly defined) Ar is the adjacencymatrix for relation r and D
k is defined as follows
Dk= diag( sum
(r1rk)isinRk
β(r1 rk s)
sdotGArk⋯Ar1FX1) (1 le k le K) (14)
Using this matrix formulation we can compute Eq7 using dynamic programming
Algorithm 1 Dynamic programming algorithm formulti-hop message passingInput sXAr(1 le r le m)W t
r (r isin R 1 le t le
k)F G δ τOutput Z
1 WKlarr I
2 for k larr K minus 1 to 1 do3 W
klarrW
k+10 W
k+1
4 end for5 for r isin R do6 Mr larr FX7 end for8 for k larr 1 to K do9 if k gt 1 then
10 for r isin R do11 M
primer larr e
δ(rs)ArsumrprimeisinR e
τ(rprimers)MrprimeW
kr
⊤
12 end for13 for r isin R do14 Mr larrM
primer
15 end for16 else17 for r isin R do18 Mr larr e
δ(rs) sdotArMrWkr
⊤
19 end for20 end if21 Z
klarr GsumrisinR MrW
k
22 end for23 Replace W
tr (0 le r le m 1 le t le k) with identity
matrices and X with 1 and re-run line 1 - line 19 tocompute d
1 d
K
24 for k larr 1 to K do25 Z
klarr (diag(dk))minus1
Zk
26 end for27 return Z
1Z
2 Z
k
E Formal Definition of K-hop RN
Definition 1 (K-hop Relation Network) A multi-hop relation network is a function that maps amulti-relational graph to a fixed size vector
KHopRN(G W E H) =K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(j r1 rk i)sdotW (hjoplus(er1⋯erk)oplushi)(15)
where denotes element-wise product and β(⋯) =1(K∣A∣ sdot ∣(j prime i) isin G ∣ j prime isin Q∣) defines thepooling weights
F Expressing K-hop RN with MultiGRN
Theorem 1 Given any W E H there exists a pa-rameter setting such that the output of the modelbecomes KHopRN(GW E H) for arbitrary G
Proof Suppose W = [W1 W2 W3] whereW1 W3 isin Rd3timesd1 W2 isin Rd3timesd2 ForMHRGN we set the parameters as followsH = HUlowast = [I0] isin R(d1+d2)timesd1 blowast =
[01]⊤ isin Rd1+d2 Wtr = diag(1 oplus er) isin
R(d1+d2)times(d1+d2)(r isin R 1 le t le K)V = W3 isin
Rd3timesd1 Vprime= [W1 W2] isin Rd3times(d1+d2) We dis-
able the relation type attention module and enablemessage passing only from Q to A By furtherchoosing σ as the identity function and perform-ing pooling over A we observe that the output of
MultiGRN becomes
1
∣A∣ sumiisinA
hprimei
=1
∣A∣ (V hi + Vprimezi)
=1
K∣A∣K
sumk=1
(V hi + Vprimezki )
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(j r1 rk i)(V hi+
VprimeW
krk⋯W
1r1xj)
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ
β(⋯)(V hi + VprimeW
krk
⋯W1r1Uφ(j)hj + V
primeW
krk⋯W
1r1bφ(j))
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(⋯)(W3hi + W1hj
+ W2(er1 ⋯ erk))
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(⋯)W (hjoplus
(er1 ⋯ erk)oplus hi)= RN(G W E H)
(16)
0 80node
31e3
pat
hK=2
0 80node
63e4K=2K=3
Figure 2 Number of K-hop relational paths wrtthe node count in extracted graphs on Common-senseQA Left The path count is polynomial wrt thenumber of nodes Right The path count is exponentialwrt the number of hops
A straightforward approach to leveraging aknowledge graph is to directly model these rela-tional paths KagNet (Lin et al 2019) and MH-PGM (Bauer et al 2018) model multi-hop relationsby extracting relational paths from KG and encod-ing them with sequence models Application ofattention mechanisms upon these relational pathscan further offer good interpretability Howeverthese models are hardly scalable because the num-ber of possible paths in a graph is (1) polynomialwrt the number of nodes (2) exponential wrt thepath length (see Fig 2) Therefore some (Weis-senborn et al 2017 Mihaylov and Frank 2018)resort to only using one-hop paths namely triplesto balance scalability and reasoning capacities
Graph neural networks (GNNs) in contrast en-joy better scalability via their message passing for-mulation but usually lack transparency The mostcommonly used GNNsrsquo variant Graph Convolu-tional Networks (GCNs) (Kipf and Welling 2017)perform message passing by aggregating neighbor-hood information for each node but ignore the rela-tion types RGCNs (Schlichtkrull et al 2018) gen-eralize GCNs by performing relation-specific ag-gregation making it applicable to multi-relationalgraphs However these models do not distinguishthe importance of different neighbors or relationtypes and thus cannot provide explicit relationalpaths for model behavior interpretation
In this paper we propose a novel graph encod-ing architecture Multi-hop Graph Relation Net-work (MHGRN) which combines the strengths ofpath-based models and GNNs Our model inheritsscalability from GNNs by preserving the messagepassing formulation It also enjoys interpretabilityof path-based models by incorporating structuredrelational attention mechanism Our key motivationis to perform multi-hop message passing within a
GCN RGCN KagNet MHGRN
Multi-Relational Encoding 7 3 3 3Interpretable 7 7 3 3
Scalable wrt node 3 3 7 3Scalable wrt hop 3 3 7 3
Table 1 Properties of our MHGRN and other repre-sentative models for graph encoding
single layer to allow each node to directly attend toits multi-hop neighbours towards multi-hop rela-tional reasoning We outline the favorable featuresof knowledge-aware QA models in Table 1 andcompare MHGRN with them
We summarize the main contributions of thiswork as follows 1) We propose MHGRN a novelmodel architecture tailored to multi-hop relationalreasoning which explicitly models multi-hop rela-tional paths at scale 2) We propose a structuredrelational attention mechanism for efficient and in-terpretable modeling of multi-hop reasoning pathsalong with its training and inference algorithms 3)We conduct extensive experiments on two questionanswering datasets and show that our models bringsignificant improvements compared to knowledge-agnostic PTLMs and outperform other graph en-coding methods by a large margin
2 Problem Formulation and Overview
In this paper we limit the scope to the task ofmultiple-choice question answering although itcan be easily generalized to other knowledge-guided tasks (eg natural language inference) Theoverall paradigm of knowledge-aware QA is illus-trated in Fig 3 Formally given an external knowl-edge graph (KG) as the knowledge source and aquestion q our goal is to identify the correct answerfrom a set C of given options We turn this probleminto measuring the plausibility score between q andeach option a isin C then selecting the one with thehighest plausibility score
Denote the vector representations of question qand option a as q and a To measure the scorefor q and a we first concatenate them to forma statement vector s = [qa] Then we extractfrom the external KG a subgraph G (ie schemagraph in KagNet (Lin et al 2019)) with the guid-ance of s (detailed in sect51) This contextualizedsubgraph is defined as a multi-relational graphG = (V E φ) Here V is a subset of entity inthe external KG containing only those relevant tos E sube V timesR times V is the set of edges that connectnodes in V where R = 1⋯m are ids of all
PlausibilityScore
Graph ExtractionFrom KG
Graph Encoder
Text Encoder
120022
119956119954
119938
Figure 3 Overview of the knowledge-aware QAframework It integrates the output from graph en-coder (for relational reasoning over contextual sub-graphs) and text encoder (for textual understanding) togenerate the plausibility score for an answer option
pre-defined relation types The mapping functionφ(i) ∶ V rarr T = EqEaEo takes node i isin V asinput and outputs Eq if i is an entity mentioned inq Ea if it is mentioned in a or Eo otherwise Wefinally encode the statement to s G to g concate-nate s and g for calculating the plausibility score
3 Background Multi-Relational GraphEncoding Methods
We leave encoding of s to pre-trained languagemodels and focus on the challenge of encodinggraph G to capture latent relations between enti-ties Current methods for encoding multi-relationalgraphs mainly fall into two categories GNNs andpath-based models GNNs encode structured in-formation by passing messages between nodes di-rectly operating on the graph structure while path-based methods first decompose the graph into pathsand then pool features over them
Graph Encoding with GNNs For a graph withn nodes a graph neural network (GNN) takes aset of node features h1h2 hn as input andcomputes their corresponding node embeddingshprime1hprime2 hprimen via message passing (Gilmeret al 2017) A compact graph representation forG can thus be obtained by pooling over the nodeembeddings hprimei
GNN(G) = Pool(hprime1hprime2 hprimen) (1)
As a notable variant of GNNs graph convo-lutional networks (GCNs) (Kipf and Welling2017) additionally update node embeddings byaggregating messages from its direct neighborsRGCNs (Schlichtkrull et al 2018) extend GCNs toencode multi-relational graphs by defining relation-
specific weight matrix Wr for each edge type
hprimei = σ
⎛⎜⎝(sumrisinR
∣N ri ∣)
minus1
sumrisinR
sumjisinN r
i
Wrhj⎞⎟⎠ (2)
where N ri denotes neighbors of node i under rela-
tion r2
While GNNs have proved to have good scala-bility their reasoning is done at the node levelmaking them incompatible with modeling pathsThis property also hinders the modelrsquos decisionsfrom being interpretable at the path levelGraph Encoding with Path-Based Models Inaddition to directly modeling the graph with GNNsa graph can also be viewed as a set of relationalpaths connecting pairs of entities
Relation Networks (RNs) (Santoro et al 2017)can be adapted to multi-relational graph encodingunder QA settings RNs use MLPs to encode alltriples (one-hop paths) in G whose head entity isin Q = j ∣ φ(j) = Eq and tail entity is inA = i ∣ φ(i) = Ea It then pools the tripleembeddings to generate a vector for G as follows
RN(G) = Pool(MLP(hj oplus eroplus
hi) ∣ j isin Q i isin A (j r i) isin E) (3)
Here hj and hi are features for nodes j and i er isthe embedding of relation r isin R oplus denotes vectorconcatenation
To further equip RN with the ability to modelnondegenerate paths KagNet (Lin et al 2019)adopts LSTMs to encode all paths connecting ques-tion entities and answer entities with lengths nomore than K It then aggregates all path embed-dings via attention mechanism
KAGNET(G) = Pool(LSTM(j r1 rk i) ∣
(j r1 j1)⋯ (jkminus1 rk i) isin E 1 le k le K)(4)
4 Proposed Method Multi-Hop GraphRelation Network (MHGRN)
This section presents Multi-hop Graph RelationNetwork (MHGRN) a novel GNN architecturethat unifies both GNNs and path-based modelsMHGRN inherits path-level reasoning and inter-pretabilty from path-based models while preserv-ing good scalability of GNNs
2For simplicity we assume a single graph convolutional
AttentivePooling
2 Multi-hop Message Passing
Node Features
Text Encodereg BERT
1 Node Type Specific Transform
3 Nonlinear Activation
timesLayer
PlausibilityScore
MLP
120022 119956
AggregateMessages
120630(119947 119955120783 hellip 119955119948 119946)
Structured Relational Att
Learn
Child
Child CapableOf UsedForminus1 AtLocation
Desires UsedForminus1 Synonym
AtLocation Synonym
Sit AtLocationUsedForminus1
Desk AtLocation
hellipMulti-hop Message Passing
Figure 4 Our proposed MHGRN architecturefor relational reasoning MHGRN takes a multi-relational graph G and a (question-answer) statementvector s as input and outputs a scalar that represent theplausibility score of this statement
41 MHGRN Model ArchitectureWe follow the GNN framework introduced in sect3where node features can be initialized with pre-trained weights (details in Appendix C) Here wefocus on the computation of node embeddings
Type-Specific Transformation To make ourmodel aware of the node type φ we first performnode type specific linear transformation on the in-put node features
xi = Uφ(i)hi + bφ(i) (5)
where the learnable parameters U and b are specificto the type of node i
Multi-Hop Message Passing As mentioned be-fore our motivation is to endow GNNs with thecapability of directly modeling paths To this endwe propose to pass messages directly over all therelational paths of lengths up to K The set of validk-hop relational paths is defined as
Φk = (j r1 rk i) ∣ (j r1 j1)⋯ (jkminus1 rk i) isin E (1 le k le K) (6)
We perform k-hop (1 le k le K) message passingover these paths which is a generalization of the
layer In practice multiple layers are stacked to enable mes-sage passing from multi-hop neighbors
single-hop message passing in RGCNs (see Eq 2)
zki = sum
(jr1rki)isinΦk
α(j r1 rk i)dki sdotWK0
⋯Wk+10 W
krk⋯W
1r1xj (1 le k le K) (7)
where the Wtr (1 le t le K 0 le r le m) matrices
are learnable3 α(j r1 rk i) is an attentionscore elaborated in sect42 and d
ki = sum(j⋯i)isinΦk
α(j⋯i)is the normalization factor The W k
rk⋯W1r1 ∣ 1 le
r1 rk le m matrices can be interpreted as thelow rank approximation of a mtimes⋯timesmktimesdtimesdtensor that assigns a separate transformation foreach k-hop relation where d is the dim of xi
Incoming messages from paths of differentlengths are aggregated via attention mecha-nism (Vaswani et al 2017)
zi =K
sumk=1
softmax(bilinear(s zki )) sdot zki (8)
Non-linear Activation Finally we apply shortcutconnection and nonlinear activation to obtain theoutput node embeddings
hprimei = σ (V hi + V
primezi) (9)
where V and Vprime are learnable model parameters
and σ is a non-linear activation function
42 Structured Relational AttentionHere we work towards effectively parameterizingthe attention score α(j r1 rk i) in Eq 7 for allk-hop paths without introducing O(mk) parametersWe first regard it as the probability of a relationsequence (φ(j) r1 rk φ(i)) conditioned on s
α(j r1 rk i) = p (φ(j) r1 rk φ(i) ∣ s) (10)
which can naturally be modeled by a probabilisticgraphical model such as conditional random field(Lafferty et al 2001)
p (⋯ ∣ s)prop exp(f(φ(j) s) +k
sumt=1
δ(rt s)
+kminus1
sumt=1
τ(rt rt+1) + g(φ(i) s))
∆= β(r1 rk s)IacuteOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveNtildeOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveIuml
Relation Type Attention
sdot γ(φ(j) φ(i) s)IacuteOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveNtildeOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveIumlNode Type Attention
(11)
3W
t0(0 le t le K) are introduced as padding matrices
so that K transformations are applied regardless of k thusensuring comparable scale of zki across different k
Model Time Space
G is a dense graph
K-hop KagNet O (mKnK+1
K) O (mKnK+1
K)K-layer RGCN O (mn2
K) O (mnK)MHGRN O (m2
n2K) O (mnK)
G is a sparse graph with maximum node degree ∆ ≪ n
K-hop KagNet O (mKnK∆
K) O (mKnK∆
K)K-layer RGCN O (mnK∆) O (mnK)MHGRN O (m2
nK∆) O (mnK)
Table 2 Computation complexity of different K-hopreasoning models on a densesparse multi-relationalgraph with n nodes and m relation types Despite thequadratic complexity wrt m MHGRNrsquos time cost issimilar to RGCN on GPUs with parallelizable matrixmultiplications (cf Fig 7)
where f(sdot) δ(sdot) and g(sdot) are parameterized bytwo-layer MLPs and τ(sdot) by a transition matrixof shape m timesm Intuitively β(sdot) models the im-portance of a k-hop relation while γ(sdot) models theimportance of messages from node type φ(j) toφ(i) (eg the model can learn to pass messagesonly from question entities to answer entities)
Our model scores a k-hop relation by decom-posing it into both context-aware single-hop re-lations (modeled by δ) and two-hop relations(modeled by τ ) We argue that τ is indis-pensable without which the model may assignhigh importance to illogical multi-hop relations(eg [AtLocation CapableOf]) or noisy re-lations (eg [RelatedTo RelatedTo])
43 Computation Complexity Analysis
Although message passing process in Eq 7 andattention module in Eq11 handles potentially ex-ponential number of paths it can be computed inlinear time with dynamic programming (see Ap-pendix D) As summarized in Table 2 time com-plexity and space complexity of MHGRN on asparse graph are both linear wrt either the maxi-mum path length K or the number of nodes n
44 Expressive Power of MHGRN
In addition to efficiency and scalability we now dis-cuss the modeling capacity of MHGRN With themessage passing formulation and relation-specifictransformations it is by nature the generalizationof RGCN It is also capable of directly modelingpaths making it interpretable as are path-basedmodels like RN and KagNet To show this we
first generalize RN (Eq 3) to the multi-hop set-ting and introduce K-hop RN (formal definition inAppendix E) which models multi-hop relation asthe composition of single-hop relations We showthat MHGRN is capable of representingK-hop RN(proof in Appendix F)
45 Learning Inference and Path DecodingWe now discuss the learning and inference processof MHGRN instantiated for QA tasks Followingthe problem formulation in sect2 we aim to deter-mine the plausibility of an answer option a isin Cgiven the question q with the information fromboth text s and graph G We first obtain the graphrepresentation g by performing attentive poolingover the output node embeddings of answer entitieshprimei ∣ i isin A Next we concatenate it with the textrepresentation s and compute the plausibility scoreby ρ(qa) = MLP(soplus g)
During training we maximize the plausibilityscore of the correct answer a by minimizing thecross-entropy loss
L = EqaC [minus logexp(ρ(q a))
sumaisinC exp(ρ(qa))] (12)
The whole model is trained end-to-end jointly withthe text encoder (eg RoBERTa)
During inference we predict the most plau-sible answer by argmaxaisinC ρ(qa) Addition-ally we can decode a reasoning path as evidencefor model predictions endowing our model withthe interpretability enjoyed by path-based mod-els Specifically we first determine the answerentity ilowast with the highest score in the pooling layerand the path length klowast with the highest score inEq 8 Then the reasoning path is decoded byargmax α(j r1 rklowast i
lowast) which can be com-puted in linear time using dynamic programming
5 Experimental SetupWe introduce how we construct G (sect51) thedatasets (sect52) as well as the baseline methods(sect53) Appendix C shows more implementationand experimental details for reproducibility
51 Extracting G from External KGWe use ConceptNet (Speer et al 2017) a general-domain knowledge graph as our external KG to testmodelsrsquo ability to harness structured knowledgesource Following KagNet (Lin et al 2019) wemerge relation types to increase graph density andadd reverse relations to construct a multi-relational
Methods BERT-Base BERT-Large RoBERTa-Large
IHdev-Acc() IHtest-Acc() IHdev-Acc() IHtest-Acc() IHdev-Acc() IHtest-Acc()
wo KG 5731 (plusmn107) 5347 (plusmn087) 6106 (plusmn085) 5539 (plusmn040) 7307 (plusmn045) 6869(plusmn056)
RGCN (Schlichtkrull et al 2018) 5694 (plusmn038) 5450 (plusmn056) 6298 (plusmn082) 5713 (plusmn036) 7269 (plusmn019) 6841 (plusmn066)GconAttn (Wang et al 2019) 5727 (plusmn070) 5484 (plusmn088) 6317 (plusmn018) 5736 (plusmn090) 7261( plusmn039) 6859 (plusmn096)KagNetdagger (Lin et al 2019) 5557 5619 6235 5716 - -RN (1-hop) 5827 (plusmn022) 5620 (plusmn045) 6304 (plusmn058) 5846 (plusmn071) 7457 (plusmn091) 6908 (plusmn021)RN (2-hop) 5981 (plusmn076) 5661 (plusmn068) 6336 (plusmn026) 5892 (plusmn014) 7365 (plusmn309) 6959 (plusmn380)
MHGRN 6036 (plusmn023) 5723 (plusmn082) 6329(plusmn051) 6059 (plusmn058) 7445 (plusmn010) 7111 (plusmn081)
Table 3 Performance comparison on CommonsenseQA in-house split We report in-house Dev (IHdev) andTest (IHtest) accuracy (mean and standard deviation of four runs) using the data split of Lin et al (2019) onCommonsenseQA dagger indicates reported results in its paper
Methods Single Ensemble
UnifiedQAdagger (Khashabi et al 2020) 791 -
RoBERTadagger 721 725RoBERTa + KEDGNdagger 725 744RoBERTa + KEdagger 733 -RoBERTa + HyKAS 20dagger (Ma et al 2019) 732 -RoBERTa + FreeLBdagger(Zhu et al 2020) 722 731XLNet + DREAMdagger 669 733XLNet + GRdagger (Lv et al 2019) 753 -ALBERTdagger (Lan et al 2019) - 765
RoBERTa + MHGRN (K = 2) 754 765
Table 4 Performance comparison on official test ofCommonsenseQA with leaderboard SoTAs4 (accuracyin ) dagger indicates reported results on leaderboard Uni-fiedQA uses T5-11B as text encoder whose number ofparameters is about 30 times more than other models
graph with 34 relation types (details in AppendixA) To extract an informative contextualized graphG from KG we recognize entity mentions in s andlink them to entities in ConceptNet with which weinitialize our node set V We then add to V all theentities that appear in any two-hop paths betweenpairs of mentioned entities Unlike KagNet we donot perform any pruning but instead reserve all theedges between nodes in V forming our G
52 Datasets
We evaluate models on two multiple-choice ques-tion answering datasets CommonsenseQA andOpenbookQA Both require world knowledge be-yond textual understanding to perform well
CommonsenseQA (Talmor et al 2019) neces-sitates various commonsense reasoning skills Thequestions are created with entities from ConceptNetand they are designed to probe latent compositionalrelations between entities in ConceptNet
OpenBookQA (Mihaylov et al 2018) provides
4Models based on ConceptNet are no longer shown on theleaderboard and we got our results from the organizers
elementary science questions together with an openbook of science facts This dataset also probesgeneral common sense beyond the provided facts
53 Compared Methods
We implement both knowledge-agnostic fine-tuning of pre-trained LMs and models that incor-porate KG as external sources as our baselinesAdditionally we directly compare our model withthe results from corresponding leaderboard Thesemethods typically leverage textual knowledge orextra training data as opposed to external KG Inall our implemented models we use pre-trainedLMs as text encoders for s for fair comparisonWe do compare our models with those (Ma et al2019 Lv et al 2019 Khashabi et al 2020) aug-mented by other text-form external knowledge (egWikipedia) although we stick to our focus of en-coding structured KG
Specifically we fine-tune BERT-BASE BERT-LARGE (Devlin et al 2019) and ROBERTA (Liuet al 2019b) for multiple-choice questions Wetake RGCN (Eq 2 in sect3) RN5 (Eq 3 in sect3)KagNet (Eq 4 in sect3) and GconAttn (Wanget al 2019) as baselines GconAttn generalizesmatch-LSTM (Wang and Jiang 2016) and achievessuccess in language inference tasks
6 Results and Discussions
In this section we present the results of our modelsin comparison with baselines as well as methodson the leaderboards for both CommonsenseQA andOpenbookQA We also provide analysis of modelsrsquocomponents and characteristics
5We use mean pooling for 1-hop RN and attentive poolingfor 2-hop RN (detailed in Appendix E)
Methods Dev () Test ()
T5-3Bdagger (Raffel et al 2019) - 8320UnifiedQAdagger (Khashabi et al 2020) - 8720
RoBERTa-Large (wo KG) 6676 (plusmn114) 6480 (plusmn237)+ RGCN 6465 (plusmn196) 6245 (plusmn157)+ GconAttn 6430 (plusmn099) 6190 (plusmn244)+ RN (1-hop) 6485 (plusmn111) 6365 (plusmn231)+ RN (2-hop) 6700 (plusmn071) 6520 (plusmn118)+ MHGRN (K = 3) 6810 (plusmn102) 6685 (plusmn119)
AristoRoBERTaV7dagger 792 778+ MHGRN (K = 3) 786 806
Table 5 Performance comparison on OpenbookQAdagger indicates reported results on leaderboard T5-3B is 8times larger than our models UnifiedQA is 30x larger
61 Main ResultsFor CommonsenseQA (Table 3) we first use thein-house data split (IH) of Lin et al (2019) (cfAppendix B) to compare our models with imple-mented baselines This is different from the offi-cial split used in the leaderboard methods Almostall KG-augmented models achieve performancegain over vanilla pre-trained LMs demonstratingthe value of external knowledge on this dataset Ad-ditionally we evaluate our MHGRN (with text en-coder being ROBERTA-LARGE) on official splitOF (Table 4) for fair comparison with other meth-ods on leaderboard in both single-model settingand ensemble-model setting With backbone be-ing T5 (Raffel et al 2019) UnifiedQA (Khashabiet al 2020) tops the leaderboard Consideringits training cost we do not intend to compare ourROBERTA-based model with it We achieve thebest performances among other models
For OpenbookQA (Table 5) we use official splitand build models with ROBERTA-LARGE as textencoder MHGRN surpasses all implemented base-lines with an absolute increase of sim2 on TestAlso as our approach is naturally compatible withthe methods that utilize textual knowledge or ex-tra data because in our paradigm the encodingof textual statement and graph are structurally-decoupled (Fig 3) To empirically show MHGRNcan bring gain over textual-knowledge empow-ered systems we replace our text encoder withAristoRoBERTaV76 and fine-tune our MHGRNupon OpenbookQA Empirically MHGRN contin-ues to bring benefit to strong-performing textual-knowledge empowered systems This indicatestextual knowledge and structured knowledge canpotentially be complementary
6httpsleaderboardallenaiorgopen_book_qasubmissionblcp1tu91i4gm0vf484g
Methods IHdev-Acc ()
MHGRN (K = 3) 7445 (plusmn010)- Type-specific transformation (sect41) 7316 (plusmn028)- Structured relational attention (sect42) 7326 (plusmn031)- Relation type attention (sect42) 7355 (plusmn068)- Node type attention (sect42) 7392 (plusmn065)
Table 6 Ablation study on model components (re-moving one component each time) using ROBERTA-LARGE as the text encoder We report the IHdev ac-curacy on CommonsenseQA
62 Performance Analysis
Ablation Study on Model Components Asshown in Table 6 disabling type-specific trans-formation results in sim 13 drop in performancedemonstrating the need for distinguishing nodetypes for QA tasks Our structured relational at-tention mechanism is also critical with its twosub-components contributing almost equally
Impact of the Amount of Training Data Weuse different fractions of training data of Common-senseQA and report results of fine-tuning text en-coders alone and jointly training text encoder andgraph encoder in Fig 5 Regardless of training datafraction our model shows consistently more per-formance improvement over knowledge-agnosticfine-tuning compared with the other graph encod-ing methods indicating MHGRNrsquos complementarystrengths to text encoders
20 40 60 80 100Percentage of Training Data
45
50
55
60
65
Accu
racy
()
MultiGRNKVMGconAttnwo KG
Figure 5 Performance change (accuracy in ) wrtthe amounts of training data on CommonsenseQA IHT-est set (same as Lin et al (2019))
Impact of Number of Hops (K) We investigatethe impact of hyperparameter K for MHGRN byits performance on CommonsenseQA (Fig 6) Theincrease ofK continues to bring benefits untilK =
4 However performance begins to drop whenK gt 3 This might be attributed to exponentialnoise in longer relational paths in knowledge graph
65 70 75Accuracy ()
K=1
K=2
K=3
K=4
K=5
Reasoning Hops
7324
7393
7465
7365
7416
Figure 6 Effect ofK in MHGRN We show IHDev ac-curacy of MHGRN on CommonsenseQA wrt hops
2 4 6 800
05
10
15
20
sba
tch
K Reasoning Hops
RGCNMHGRN
Figure 7 Analysis of model scalability Comparisonof per-batch training efficiency wrt hops K
63 Model ScalabilityFig 7 presents the computation cost of MHGRNand RGCN (measured by training time) Both growlinearly wrt K Although the theoretical complex-ity of MultiRGN ism times that of RGCN the ratioof their empirical cost only approaches 2 demon-strating that our model can be better parallelized
64 Model InterpretabilityWe can analyze our modelrsquos reasoning process bydecoding the reasoning path using the method de-scribed in sect45 Fig 8 shows two examples fromCommonsenseQA where our model correctly an-swers the questions and provides reasonable pathevidences In the example on the left the modellinks question entities and answer entity in a chainto support reasoning while the example on the rightprovides a case where our model leverage unmen-tioned entities to bridge the reasoning gap betweenquestion entity and answer entities in a way that iscoherent with the latent relation between CHAPEL
and the desired answer in the question
7 Related WorkKnowledge-Aware Methods for NLP Variouswork have investigated the potential to empowerNLP models with external knowledge Many at-tempt to extract structured knowledge either in theform of nodes (Yang and Mitchell 2017 Wanget al 2019) triples (Weissenborn et al 2017
Where is known for a multitude of wedding chapels
A town B texasC city D church building E Nevada
RenoChapel
PartOf
AtLocation
UsedForminus1 Synonym
AtLocation Synonym
Neveda
Baseball
Play_Baseball
HasProperty
UsedForminus1
Fun_to_Play
Why do parents encourage their kids to play baseball
A round B cheap C break window D hardE fun to play Fun_to_Play
HasProperty
Reno
Chapel
AtLocation
Neveda
PartOf
MultiGRNFigure 8 Case study on model interpretability Wepresent two sample questions from CommonsenseQAwith the reasoning paths output by MHGRN
Mihaylov and Frank 2018) paths (Bauer et al2018 Kundu et al 2019 Lin et al 2019) or sub-graphs (Li and Clark 2015) and encode them toaugment textual understanding
Recent success of pre-trained LMs motivatesmany (Pan et al 2019 Ye et al 2019 Zhang et al2018 Li et al 2019 Banerjee et al 2019) to probeLMsrsquo potential as latent knowledge bases This lineof work turn to textual knowledge (eg Wikipedia)to directly impart knowledge to pre-trained LMsThey generally fall into two paradigms 1) Fine-tuning LMs on large-scale general-domain datasets(eg RACE (Lai et al 2017)) or on knowledge-richtext 2) Providing LMs with evidence via informa-tion retrieval techniques However these modelscannot provide explicit reasoning and evidencethus hardly trustworthy They are also subject tothe availability of in-domain datasets and maxi-mum input token of pre-trained LMs
Neural Graph Encoding Graph Attention Net-works (GAT) (Velickovic et al 2018) incorpo-rates attention mechanism in feature aggregationRGCN (Schlichtkrull et al 2018) proposes rela-tional message passing which makes it applicableto multi-relational graphs However they only per-form single-hop message passing and cannot beinterpreted at path level Other work (Abu-El-Haijaet al 2019 Nikolentzos et al 2019) aggregate fora node its K-hop neighbors based on node-wisedistances but they are designed for non-relationalgraphs MHGRN addresses these issues by rea-soning on multi-relational graphs and being inter-pretable via maintaining paths as reasoning chains
8 Conclusion
We present a principled scalable method MHGRNthat can leverage general knowledge via multi-hopreasoning over interpretable structures (eg Con-ceptNet) The proposed MHGRN generalizes andcombines the advantages of GNNs and path-basedreasoning models It explicitly performs multi-hop
relational reasoning and is empirically shown tooutperform existing methods with superior scal-ablility and interpretability
ReferencesSami Abu-El-Haija Bryan Perozzi Amol Kapoor
Nazanin Alipourfard Kristina Lerman Hrayr Haru-tyunyan Greg Ver Steeg and Aram Galstyan 2019Mixhop Higher-order graph convolutional architec-tures via sparsified neighborhood mixing In Pro-ceedings of the 36th International Conference onMachine Learning ICML 2019 9-15 June 2019Long Beach California USA volume 97 of Pro-ceedings of Machine Learning Research pages 21ndash29 PMLR
Pratyay Banerjee Kuntal Kumar Pal Arindam Mi-tra and Chitta Baral 2019 Careful selection ofknowledge to solve open book question answeringIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages6120ndash6129 Florence Italy Association for Compu-tational Linguistics
Lisa Bauer Yicheng Wang and Mohit Bansal 2018Commonsense for generative multi-hop question an-swering tasks In Proceedings of the 2018 Confer-ence on Empirical Methods in Natural LanguageProcessing Brussels Belgium October 31 - Novem-ber 4 2018 pages 4220ndash4230 Association for Com-putational Linguistics
Peter Clark Isaac Cowhey Oren Etzioni Tushar KhotAshish Sabharwal Carissa Schoenick and OyvindTafjord 2018 Think you have solved questionanswering try arc the AI2 reasoning challengeCoRR abs180305457
Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics
Justin Gilmer Samuel S Schoenholz Patrick F RileyOriol Vinyals and George E Dahl 2017 Neuralmessage passing for quantum chemistry In Pro-ceedings of the 34th International Conference onMachine Learning ICML 2017 Sydney NSW Aus-tralia 6-11 August 2017 volume 70 of Proceedingsof Machine Learning Research pages 1263ndash1272PMLR
Daniel Khashabi Tushar Khot Ashish SabharwalOyvind Tafjord Peter Clark and Hannaneh Ha-jishirzi 2020 Unifiedqa Crossing format bound-aries with a single qa system
Thomas N Kipf and Max Welling 2017 Semi-supervised classification with graph convolutionalnetworks In 5th International Conference on Learn-ing Representations ICLR 2017 Toulon FranceApril 24-26 2017 Conference Track ProceedingsOpenReviewnet
Souvik Kundu Tushar Khot Ashish Sabharwal andPeter Clark 2019 Exploiting explicit paths formulti-hop reading comprehension In Proceedingsof the 57th Annual Meeting of the Association forComputational Linguistics pages 2737ndash2747 Flo-rence Italy Association for Computational Linguis-tics
John D Lafferty Andrew McCallum and FernandoC N Pereira 2001 Conditional random fieldsProbabilistic models for segmenting and labeling se-quence data In Proceedings of the Eighteenth Inter-national Conference on Machine Learning (ICML2001) Williams College Williamstown MA USAJune 28 - July 1 2001 pages 282ndash289 MorganKaufmann
Guokun Lai Qizhe Xie Hanxiao Liu Yiming Yangand Eduard Hovy 2017 RACE Large-scale ReAd-ing comprehension dataset from examinations InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing pages785ndash794 Copenhagen Denmark Association forComputational Linguistics
Zhen-Zhong Lan Mingda Chen Sebastian Good-man Kevin Gimpel Piyush Sharma and RaduSoricut 2019 Albert A lite bert for self-supervised learning of language representationsArXiv abs190911942
Shiyang Li Jianshu Chen and Dian Yu 2019 Teach-ing pretrained models with commonsense reason-ing A preliminary kb-based approach CoRRabs190909743
Yang Li and Peter Clark 2015 Answering elementaryscience questions by constructing coherent scenesusing background knowledge In Proceedings ofthe 2015 Conference on Empirical Methods in Nat-ural Language Processing pages 2007ndash2012 Lis-bon Portugal Association for Computational Lin-guistics
Bill Yuchen Lin Xinyue Chen Jamin Chen and Xi-ang Ren 2019 KagNet Knowledge-aware graphnetworks for commonsense reasoning In Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) pages 2829ndash2839 HongKong China Association for Computational Lin-guistics
Liyuan Liu Haoming Jiang Pengcheng He WeizhuChen Xiaodong Liu Jianfeng Gao and Jiawei Han2019a On the variance of the adaptive learning rateand beyond CoRR abs190803265
Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019bRoberta A robustly optimized BERT pretraining ap-proach CoRR abs190711692
Shangwen Lv Daya Guo Jingjing Xu Duyu TangNan Duan Ming Gong Linjun Shou Daxin JiangGuihong Cao and Songlin Hu 2019 Graph-based reasoning over heterogeneous external knowl-edge for commonsense question answering CoRRabs190905311
Kaixin Ma Jonathan Francis Quanyang Lu Eric Ny-berg and Alessandro Oltramari 2019 Towardsgeneralizable neuro-symbolic systems for common-sense question answering In Proceedings of theFirst Workshop on Commonsense Inference in Natu-ral Language Processing pages 22ndash32 Hong KongChina Association for Computational Linguistics
Todor Mihaylov Peter Clark Tushar Khot and AshishSabharwal 2018 Can a suit of armor conduct elec-tricity a new dataset for open book question an-swering In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processingpages 2381ndash2391 Brussels Belgium Associationfor Computational Linguistics
Todor Mihaylov and Anette Frank 2018 Knowledge-able reader Enhancing cloze-style reading compre-hension with external commonsense knowledge InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 821ndash832 Melbourne AustraliaAssociation for Computational Linguistics
Giannis Nikolentzos George Dasoulas and MichalisVazirgiannis 2019 k-hop graph neural networksCoRR abs190706051
Xiaoman Pan Kai Sun Dian Yu Heng Ji and Dong Yu2019 Improving question answering with externalknowledge CoRR abs190200993
Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2019 Exploring the limitsof transfer learning with a unified text-to-text trans-former
Adam Santoro David Raposo David G T BarrettMateusz Malinowski Razvan Pascanu Peter WBattaglia and Tim Lillicrap 2017 A simple neu-ral network module for relational reasoning In Ad-vances in Neural Information Processing Systems30 Annual Conference on Neural Information Pro-cessing Systems 2017 4-9 December 2017 LongBeach CA USA pages 4967ndash4976
Maarten Sap Hannah Rashkin Derek Chen RonanLe Bras and Yejin Choi 2019 Social IQa Com-monsense reasoning about social interactions InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the
9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) pages 4463ndash4473 Hong Kong China Association for Computa-tional Linguistics
Michael Sejr Schlichtkrull Thomas N Kipf PeterBloem Rianne van den Berg Ivan Titov and MaxWelling 2018 Modeling relational data with graphconvolutional networks In The Semantic Web - 15thInternational Conference ESWC 2018 HeraklionCrete Greece June 3-7 2018 Proceedings volume10843 of Lecture Notes in Computer Science pages593ndash607 Springer
Robyn Speer Joshua Chin and Catherine Havasi 2017Conceptnet 55 An open multilingual graph of gen-eral knowledge In Proceedings of the Thirty-FirstAAAI Conference on Artificial Intelligence Febru-ary 4-9 2017 San Francisco California USApages 4444ndash4451 AAAI Press
Alon Talmor Jonathan Herzig Nicholas Lourie andJonathan Berant 2019 CommonsenseQA A ques-tion answering challenge targeting commonsenseknowledge In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4149ndash4158 Minneapolis Minnesota Associ-ation for Computational Linguistics
Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez LukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems 30 Annual Conference on NeuralInformation Processing Systems 2017 4-9 Decem-ber 2017 Long Beach CA USA pages 5998ndash6008
Petar Velickovic Guillem Cucurull Arantxa CasanovaAdriana Romero Pietro Lio and Yoshua Bengio2018 Graph attention networks In 6th Inter-national Conference on Learning RepresentationsICLR 2018 Vancouver BC Canada April 30 - May3 2018 Conference Track Proceedings OpenRe-viewnet
Shuohang Wang and Jing Jiang 2016 Learning nat-ural language inference with LSTM In Proceed-ings of the 2016 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics Human Language Technologies pages1442ndash1451 San Diego California Association forComputational Linguistics
Xiaoyan Wang Pavan Kapanipathi Ryan MusaMo Yu Kartik Talamadupula Ibrahim AbdelazizMaria Chang Achille Fokoue Bassem MakniNicholas Mattei and Michael Witbrock 2019 Im-proving natural language inference using externalknowledge in the science questions domain In TheThirty-Third AAAI Conference on Artificial Intelli-gence AAAI 2019 The Thirty-First Innovative Ap-plications of Artificial Intelligence Conference IAAI2019 The Ninth AAAI Symposium on Educational
Advances in Artificial Intelligence EAAI 2019 Hon-olulu Hawaii USA January 27 - February 1 2019pages 7208ndash7215 AAAI Press
Dirk Weissenborn Tomas Kocisky and Chris Dyer2017 Dynamic integration of background knowl-edge in neural nlu systems arXiv preprintarXiv170602596
An Yang Quan Wang Jing Liu Kai Liu Yajuan LyuHua Wu Qiaoqiao She and Sujian Li 2019 En-hancing pre-trained language representations withrich knowledge for machine reading comprehensionIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2346ndash2357 Florence Italy Association for Compu-tational Linguistics
Bishan Yang and Tom Mitchell 2017 Leveragingknowledge bases in LSTMs for improving machinereading In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1 Long Papers) pages 1436ndash1446 Van-couver Canada Association for Computational Lin-guistics
Zhi-Xiu Ye Qian Chen Wen Wang and Zhen-HuaLing 2019 Align mask and select A sim-ple method for incorporating commonsense knowl-edge into language representation models CoRRabs190806725
Yuyu Zhang Hanjun Dai Kamil Toraman andLe Song 2018 Kgˆ2 Learning to reason scienceexam questions with contextual knowledge graphembeddings CoRR abs180512393
Chen Zhu Yu Cheng Zhe Gan Siqi Sun Tom Gold-stein and Jingjing Liu 2020 Freelb Enhanced ad-versarial training for natural language understandingIn International Conference on Learning Represen-tations
A Merging Types of Relations inConceptNet
Relation Merged Relation
AtLocationAtLocation
LocatedNear
CausesCausesCausesDesire
MotivatedByGoal
AntonymAntonym
DistinctFrom
HasSubevent
HasSubevent
HasFirstSubeventHasLastSubeventHasPrerequisite
EntailsMannerOf
IsAIsAInstanceOf
DefinedAs
PartOfPartOf
HasA
RelatedToRelatedToSimilarTo
Synonym
Table 7 Relations in ConceptNet that are being mergedin pre-processing RelationX indicates the reverse re-lation of RelationX
We merge relations that are close in semanticsas well as in the general usage of triple instancesin ConceptNet
B Dataset Split Specifications
Train Dev Test
CommonsenseQA (OF) 9 741 1 221 1 140CommonsenseQA (IH) 8 500 1 221 1 241OpenbookQA 4 957 500 500
Table 8 Numbers of instances in different datasetsplits
CommonsenseQA 7 and OpenbookQA 8 all havetheir leaderboards with training and development
7httpswwwtau-nlporgcommonsenseqa8httpsleaderboardallenaiorgopen_
book_qasubmissionspublic
set publicly available As the ground truth labelsfor CommonsenseQA are not readily available formodel analysis we take 1241 examples from of-ficial training examples as our in-house test ex-amples and regard the remaining 8500 ones asour in-house training examples (CommonsenseQA(IH))
C Implementation Details
CommonsenseQA OpenbookQA
BERT-BASE 3 times 10minus5 -
BERT-LARGE 2 times 10minus5 -
ROBERTA-LARGE 1 times 10minus5
1 times 10minus5
Table 9 Learning rate for text encoders on differentdatasets
CommonsenseQA OpenbookQA
RN 3 times 10minus4
3 times 10minus4
RGCN 1 times 10minus3
1 times 10minus3
GconAttn 3 times 10minus4
1 times 10minus4
MHGRN 1 times 10minus3
1 times 10minus3
Table 10 Learning rate for graph encoders on differentdatasets
Param
RN 399KRGCN 365KGconAttn 453KMHGRN 544K
Table 11 Numbers of parameters of different graph en-coders
Our models are implemented in PyTorch We usecross-entropy loss and RAdam (Liu et al 2019a)optimizer We find it beneficial to use separatelearning rates for the text encoder and the graph en-coder We tune learning rates for text encoders andgraph encoders on two datasets We first fine-tuneROBERTA-LARGE BERT-LARGE BERT-BASE
on CommonsenseQA and ROBERTA-LARGE onOpenbookQA respectively and choose a dataset-specific learning rate from 1times10
minus5 2times10
minus5 3times
10minus5 6 times 10
minus5 1 times 10
minus4 for each text encoderbased on the best performance on development setas listed in Table 9 We report the performance ofthese fine-tuned text encoders and also adopt theirdataset-specific optimal learning rates in joint train-ing with graph encoders For models that involve
KG the learning rate of their graph encoders arechosen from 1 times 10
minus4 3 times 10
minus4 1 times 10
minus3 3 times
10minus3 based on their best development set perfor-
mance with ROBERTA-LARGE as the text encoderWe report the optimal learning rates for graph en-coders in Table 10 In training we set the max-imum input sequence length to text encoders to64 batch size to 32 and perform early stoppingAristoRoBERTaV7+MHGRN is the only exceptionIn order to host fair comparison we follow Aris-toRoBERTaV7 and set the batch size to 16 maxinput sequence length to 256 and choose a decoderlearning rate from 1 times 10
minus3 2 times 10
minus5For the input node features we first use tem-
plates to turn knowledge triples in ConceptNet intosentences and feed them into pre-trained BERT-LARGE obtaining a sequence of token embeddingsfrom the last layer of BERT-LARGE for each tripleFor each entity in ConceptNet we perform meanpooling over the tokens of the entityrsquos occurrencesacross all the sentences to form a 1024d vector asits corresponding node feature We use this set offeatures for all our implemented models
We use 2-layer RGCN and single-layer MHGRNacross our experiments
The numbers of parameter for each graph en-coder are listed in Table 11
D Dynamic Programming Algorithm forEq 7
To show that multi-hop message passing can becomputed in linear time we observe that Eq 7 canbe re-written in matrix form
Zk= (Dk)minus1 sum
(r1rk)isinRk
β(r1 rk s)
sdotGArk⋯Ar1FXW1r1
⊤⋯W
krk
⊤
sdotWk+10
⊤⋯W
K0
⊤ (1 le k le K) (13)
where G = diag(exp([g(φ(v1) s) g(φ(vn)s)]) (F is similarly defined) Ar is the adjacencymatrix for relation r and D
k is defined as follows
Dk= diag( sum
(r1rk)isinRk
β(r1 rk s)
sdotGArk⋯Ar1FX1) (1 le k le K) (14)
Using this matrix formulation we can compute Eq7 using dynamic programming
Algorithm 1 Dynamic programming algorithm formulti-hop message passingInput sXAr(1 le r le m)W t
r (r isin R 1 le t le
k)F G δ τOutput Z
1 WKlarr I
2 for k larr K minus 1 to 1 do3 W
klarrW
k+10 W
k+1
4 end for5 for r isin R do6 Mr larr FX7 end for8 for k larr 1 to K do9 if k gt 1 then
10 for r isin R do11 M
primer larr e
δ(rs)ArsumrprimeisinR e
τ(rprimers)MrprimeW
kr
⊤
12 end for13 for r isin R do14 Mr larrM
primer
15 end for16 else17 for r isin R do18 Mr larr e
δ(rs) sdotArMrWkr
⊤
19 end for20 end if21 Z
klarr GsumrisinR MrW
k
22 end for23 Replace W
tr (0 le r le m 1 le t le k) with identity
matrices and X with 1 and re-run line 1 - line 19 tocompute d
1 d
K
24 for k larr 1 to K do25 Z
klarr (diag(dk))minus1
Zk
26 end for27 return Z
1Z
2 Z
k
E Formal Definition of K-hop RN
Definition 1 (K-hop Relation Network) A multi-hop relation network is a function that maps amulti-relational graph to a fixed size vector
KHopRN(G W E H) =K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(j r1 rk i)sdotW (hjoplus(er1⋯erk)oplushi)(15)
where denotes element-wise product and β(⋯) =1(K∣A∣ sdot ∣(j prime i) isin G ∣ j prime isin Q∣) defines thepooling weights
F Expressing K-hop RN with MultiGRN
Theorem 1 Given any W E H there exists a pa-rameter setting such that the output of the modelbecomes KHopRN(GW E H) for arbitrary G
Proof Suppose W = [W1 W2 W3] whereW1 W3 isin Rd3timesd1 W2 isin Rd3timesd2 ForMHRGN we set the parameters as followsH = HUlowast = [I0] isin R(d1+d2)timesd1 blowast =
[01]⊤ isin Rd1+d2 Wtr = diag(1 oplus er) isin
R(d1+d2)times(d1+d2)(r isin R 1 le t le K)V = W3 isin
Rd3timesd1 Vprime= [W1 W2] isin Rd3times(d1+d2) We dis-
able the relation type attention module and enablemessage passing only from Q to A By furtherchoosing σ as the identity function and perform-ing pooling over A we observe that the output of
MultiGRN becomes
1
∣A∣ sumiisinA
hprimei
=1
∣A∣ (V hi + Vprimezi)
=1
K∣A∣K
sumk=1
(V hi + Vprimezki )
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(j r1 rk i)(V hi+
VprimeW
krk⋯W
1r1xj)
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ
β(⋯)(V hi + VprimeW
krk
⋯W1r1Uφ(j)hj + V
primeW
krk⋯W
1r1bφ(j))
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(⋯)(W3hi + W1hj
+ W2(er1 ⋯ erk))
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(⋯)W (hjoplus
(er1 ⋯ erk)oplus hi)= RN(G W E H)
(16)
PlausibilityScore
Graph ExtractionFrom KG
Graph Encoder
Text Encoder
120022
119956119954
119938
Figure 3 Overview of the knowledge-aware QAframework It integrates the output from graph en-coder (for relational reasoning over contextual sub-graphs) and text encoder (for textual understanding) togenerate the plausibility score for an answer option
pre-defined relation types The mapping functionφ(i) ∶ V rarr T = EqEaEo takes node i isin V asinput and outputs Eq if i is an entity mentioned inq Ea if it is mentioned in a or Eo otherwise Wefinally encode the statement to s G to g concate-nate s and g for calculating the plausibility score
3 Background Multi-Relational GraphEncoding Methods
We leave encoding of s to pre-trained languagemodels and focus on the challenge of encodinggraph G to capture latent relations between enti-ties Current methods for encoding multi-relationalgraphs mainly fall into two categories GNNs andpath-based models GNNs encode structured in-formation by passing messages between nodes di-rectly operating on the graph structure while path-based methods first decompose the graph into pathsand then pool features over them
Graph Encoding with GNNs For a graph withn nodes a graph neural network (GNN) takes aset of node features h1h2 hn as input andcomputes their corresponding node embeddingshprime1hprime2 hprimen via message passing (Gilmeret al 2017) A compact graph representation forG can thus be obtained by pooling over the nodeembeddings hprimei
GNN(G) = Pool(hprime1hprime2 hprimen) (1)
As a notable variant of GNNs graph convo-lutional networks (GCNs) (Kipf and Welling2017) additionally update node embeddings byaggregating messages from its direct neighborsRGCNs (Schlichtkrull et al 2018) extend GCNs toencode multi-relational graphs by defining relation-
specific weight matrix Wr for each edge type
hprimei = σ
⎛⎜⎝(sumrisinR
∣N ri ∣)
minus1
sumrisinR
sumjisinN r
i
Wrhj⎞⎟⎠ (2)
where N ri denotes neighbors of node i under rela-
tion r2
While GNNs have proved to have good scala-bility their reasoning is done at the node levelmaking them incompatible with modeling pathsThis property also hinders the modelrsquos decisionsfrom being interpretable at the path levelGraph Encoding with Path-Based Models Inaddition to directly modeling the graph with GNNsa graph can also be viewed as a set of relationalpaths connecting pairs of entities
Relation Networks (RNs) (Santoro et al 2017)can be adapted to multi-relational graph encodingunder QA settings RNs use MLPs to encode alltriples (one-hop paths) in G whose head entity isin Q = j ∣ φ(j) = Eq and tail entity is inA = i ∣ φ(i) = Ea It then pools the tripleembeddings to generate a vector for G as follows
RN(G) = Pool(MLP(hj oplus eroplus
hi) ∣ j isin Q i isin A (j r i) isin E) (3)
Here hj and hi are features for nodes j and i er isthe embedding of relation r isin R oplus denotes vectorconcatenation
To further equip RN with the ability to modelnondegenerate paths KagNet (Lin et al 2019)adopts LSTMs to encode all paths connecting ques-tion entities and answer entities with lengths nomore than K It then aggregates all path embed-dings via attention mechanism
KAGNET(G) = Pool(LSTM(j r1 rk i) ∣
(j r1 j1)⋯ (jkminus1 rk i) isin E 1 le k le K)(4)
4 Proposed Method Multi-Hop GraphRelation Network (MHGRN)
This section presents Multi-hop Graph RelationNetwork (MHGRN) a novel GNN architecturethat unifies both GNNs and path-based modelsMHGRN inherits path-level reasoning and inter-pretabilty from path-based models while preserv-ing good scalability of GNNs
2For simplicity we assume a single graph convolutional
AttentivePooling
2 Multi-hop Message Passing
Node Features
Text Encodereg BERT
1 Node Type Specific Transform
3 Nonlinear Activation
timesLayer
PlausibilityScore
MLP
120022 119956
AggregateMessages
120630(119947 119955120783 hellip 119955119948 119946)
Structured Relational Att
Learn
Child
Child CapableOf UsedForminus1 AtLocation
Desires UsedForminus1 Synonym
AtLocation Synonym
Sit AtLocationUsedForminus1
Desk AtLocation
hellipMulti-hop Message Passing
Figure 4 Our proposed MHGRN architecturefor relational reasoning MHGRN takes a multi-relational graph G and a (question-answer) statementvector s as input and outputs a scalar that represent theplausibility score of this statement
41 MHGRN Model ArchitectureWe follow the GNN framework introduced in sect3where node features can be initialized with pre-trained weights (details in Appendix C) Here wefocus on the computation of node embeddings
Type-Specific Transformation To make ourmodel aware of the node type φ we first performnode type specific linear transformation on the in-put node features
xi = Uφ(i)hi + bφ(i) (5)
where the learnable parameters U and b are specificto the type of node i
Multi-Hop Message Passing As mentioned be-fore our motivation is to endow GNNs with thecapability of directly modeling paths To this endwe propose to pass messages directly over all therelational paths of lengths up to K The set of validk-hop relational paths is defined as
Φk = (j r1 rk i) ∣ (j r1 j1)⋯ (jkminus1 rk i) isin E (1 le k le K) (6)
We perform k-hop (1 le k le K) message passingover these paths which is a generalization of the
layer In practice multiple layers are stacked to enable mes-sage passing from multi-hop neighbors
single-hop message passing in RGCNs (see Eq 2)
zki = sum
(jr1rki)isinΦk
α(j r1 rk i)dki sdotWK0
⋯Wk+10 W
krk⋯W
1r1xj (1 le k le K) (7)
where the Wtr (1 le t le K 0 le r le m) matrices
are learnable3 α(j r1 rk i) is an attentionscore elaborated in sect42 and d
ki = sum(j⋯i)isinΦk
α(j⋯i)is the normalization factor The W k
rk⋯W1r1 ∣ 1 le
r1 rk le m matrices can be interpreted as thelow rank approximation of a mtimes⋯timesmktimesdtimesdtensor that assigns a separate transformation foreach k-hop relation where d is the dim of xi
Incoming messages from paths of differentlengths are aggregated via attention mecha-nism (Vaswani et al 2017)
zi =K
sumk=1
softmax(bilinear(s zki )) sdot zki (8)
Non-linear Activation Finally we apply shortcutconnection and nonlinear activation to obtain theoutput node embeddings
hprimei = σ (V hi + V
primezi) (9)
where V and Vprime are learnable model parameters
and σ is a non-linear activation function
42 Structured Relational AttentionHere we work towards effectively parameterizingthe attention score α(j r1 rk i) in Eq 7 for allk-hop paths without introducing O(mk) parametersWe first regard it as the probability of a relationsequence (φ(j) r1 rk φ(i)) conditioned on s
α(j r1 rk i) = p (φ(j) r1 rk φ(i) ∣ s) (10)
which can naturally be modeled by a probabilisticgraphical model such as conditional random field(Lafferty et al 2001)
p (⋯ ∣ s)prop exp(f(φ(j) s) +k
sumt=1
δ(rt s)
+kminus1
sumt=1
τ(rt rt+1) + g(φ(i) s))
∆= β(r1 rk s)IacuteOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveNtildeOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveIuml
Relation Type Attention
sdot γ(φ(j) φ(i) s)IacuteOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveNtildeOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveIumlNode Type Attention
(11)
3W
t0(0 le t le K) are introduced as padding matrices
so that K transformations are applied regardless of k thusensuring comparable scale of zki across different k
Model Time Space
G is a dense graph
K-hop KagNet O (mKnK+1
K) O (mKnK+1
K)K-layer RGCN O (mn2
K) O (mnK)MHGRN O (m2
n2K) O (mnK)
G is a sparse graph with maximum node degree ∆ ≪ n
K-hop KagNet O (mKnK∆
K) O (mKnK∆
K)K-layer RGCN O (mnK∆) O (mnK)MHGRN O (m2
nK∆) O (mnK)
Table 2 Computation complexity of different K-hopreasoning models on a densesparse multi-relationalgraph with n nodes and m relation types Despite thequadratic complexity wrt m MHGRNrsquos time cost issimilar to RGCN on GPUs with parallelizable matrixmultiplications (cf Fig 7)
where f(sdot) δ(sdot) and g(sdot) are parameterized bytwo-layer MLPs and τ(sdot) by a transition matrixof shape m timesm Intuitively β(sdot) models the im-portance of a k-hop relation while γ(sdot) models theimportance of messages from node type φ(j) toφ(i) (eg the model can learn to pass messagesonly from question entities to answer entities)
Our model scores a k-hop relation by decom-posing it into both context-aware single-hop re-lations (modeled by δ) and two-hop relations(modeled by τ ) We argue that τ is indis-pensable without which the model may assignhigh importance to illogical multi-hop relations(eg [AtLocation CapableOf]) or noisy re-lations (eg [RelatedTo RelatedTo])
43 Computation Complexity Analysis
Although message passing process in Eq 7 andattention module in Eq11 handles potentially ex-ponential number of paths it can be computed inlinear time with dynamic programming (see Ap-pendix D) As summarized in Table 2 time com-plexity and space complexity of MHGRN on asparse graph are both linear wrt either the maxi-mum path length K or the number of nodes n
44 Expressive Power of MHGRN
In addition to efficiency and scalability we now dis-cuss the modeling capacity of MHGRN With themessage passing formulation and relation-specifictransformations it is by nature the generalizationof RGCN It is also capable of directly modelingpaths making it interpretable as are path-basedmodels like RN and KagNet To show this we
first generalize RN (Eq 3) to the multi-hop set-ting and introduce K-hop RN (formal definition inAppendix E) which models multi-hop relation asthe composition of single-hop relations We showthat MHGRN is capable of representingK-hop RN(proof in Appendix F)
45 Learning Inference and Path DecodingWe now discuss the learning and inference processof MHGRN instantiated for QA tasks Followingthe problem formulation in sect2 we aim to deter-mine the plausibility of an answer option a isin Cgiven the question q with the information fromboth text s and graph G We first obtain the graphrepresentation g by performing attentive poolingover the output node embeddings of answer entitieshprimei ∣ i isin A Next we concatenate it with the textrepresentation s and compute the plausibility scoreby ρ(qa) = MLP(soplus g)
During training we maximize the plausibilityscore of the correct answer a by minimizing thecross-entropy loss
L = EqaC [minus logexp(ρ(q a))
sumaisinC exp(ρ(qa))] (12)
The whole model is trained end-to-end jointly withthe text encoder (eg RoBERTa)
During inference we predict the most plau-sible answer by argmaxaisinC ρ(qa) Addition-ally we can decode a reasoning path as evidencefor model predictions endowing our model withthe interpretability enjoyed by path-based mod-els Specifically we first determine the answerentity ilowast with the highest score in the pooling layerand the path length klowast with the highest score inEq 8 Then the reasoning path is decoded byargmax α(j r1 rklowast i
lowast) which can be com-puted in linear time using dynamic programming
5 Experimental SetupWe introduce how we construct G (sect51) thedatasets (sect52) as well as the baseline methods(sect53) Appendix C shows more implementationand experimental details for reproducibility
51 Extracting G from External KGWe use ConceptNet (Speer et al 2017) a general-domain knowledge graph as our external KG to testmodelsrsquo ability to harness structured knowledgesource Following KagNet (Lin et al 2019) wemerge relation types to increase graph density andadd reverse relations to construct a multi-relational
Methods BERT-Base BERT-Large RoBERTa-Large
IHdev-Acc() IHtest-Acc() IHdev-Acc() IHtest-Acc() IHdev-Acc() IHtest-Acc()
wo KG 5731 (plusmn107) 5347 (plusmn087) 6106 (plusmn085) 5539 (plusmn040) 7307 (plusmn045) 6869(plusmn056)
RGCN (Schlichtkrull et al 2018) 5694 (plusmn038) 5450 (plusmn056) 6298 (plusmn082) 5713 (plusmn036) 7269 (plusmn019) 6841 (plusmn066)GconAttn (Wang et al 2019) 5727 (plusmn070) 5484 (plusmn088) 6317 (plusmn018) 5736 (plusmn090) 7261( plusmn039) 6859 (plusmn096)KagNetdagger (Lin et al 2019) 5557 5619 6235 5716 - -RN (1-hop) 5827 (plusmn022) 5620 (plusmn045) 6304 (plusmn058) 5846 (plusmn071) 7457 (plusmn091) 6908 (plusmn021)RN (2-hop) 5981 (plusmn076) 5661 (plusmn068) 6336 (plusmn026) 5892 (plusmn014) 7365 (plusmn309) 6959 (plusmn380)
MHGRN 6036 (plusmn023) 5723 (plusmn082) 6329(plusmn051) 6059 (plusmn058) 7445 (plusmn010) 7111 (plusmn081)
Table 3 Performance comparison on CommonsenseQA in-house split We report in-house Dev (IHdev) andTest (IHtest) accuracy (mean and standard deviation of four runs) using the data split of Lin et al (2019) onCommonsenseQA dagger indicates reported results in its paper
Methods Single Ensemble
UnifiedQAdagger (Khashabi et al 2020) 791 -
RoBERTadagger 721 725RoBERTa + KEDGNdagger 725 744RoBERTa + KEdagger 733 -RoBERTa + HyKAS 20dagger (Ma et al 2019) 732 -RoBERTa + FreeLBdagger(Zhu et al 2020) 722 731XLNet + DREAMdagger 669 733XLNet + GRdagger (Lv et al 2019) 753 -ALBERTdagger (Lan et al 2019) - 765
RoBERTa + MHGRN (K = 2) 754 765
Table 4 Performance comparison on official test ofCommonsenseQA with leaderboard SoTAs4 (accuracyin ) dagger indicates reported results on leaderboard Uni-fiedQA uses T5-11B as text encoder whose number ofparameters is about 30 times more than other models
graph with 34 relation types (details in AppendixA) To extract an informative contextualized graphG from KG we recognize entity mentions in s andlink them to entities in ConceptNet with which weinitialize our node set V We then add to V all theentities that appear in any two-hop paths betweenpairs of mentioned entities Unlike KagNet we donot perform any pruning but instead reserve all theedges between nodes in V forming our G
52 Datasets
We evaluate models on two multiple-choice ques-tion answering datasets CommonsenseQA andOpenbookQA Both require world knowledge be-yond textual understanding to perform well
CommonsenseQA (Talmor et al 2019) neces-sitates various commonsense reasoning skills Thequestions are created with entities from ConceptNetand they are designed to probe latent compositionalrelations between entities in ConceptNet
OpenBookQA (Mihaylov et al 2018) provides
4Models based on ConceptNet are no longer shown on theleaderboard and we got our results from the organizers
elementary science questions together with an openbook of science facts This dataset also probesgeneral common sense beyond the provided facts
53 Compared Methods
We implement both knowledge-agnostic fine-tuning of pre-trained LMs and models that incor-porate KG as external sources as our baselinesAdditionally we directly compare our model withthe results from corresponding leaderboard Thesemethods typically leverage textual knowledge orextra training data as opposed to external KG Inall our implemented models we use pre-trainedLMs as text encoders for s for fair comparisonWe do compare our models with those (Ma et al2019 Lv et al 2019 Khashabi et al 2020) aug-mented by other text-form external knowledge (egWikipedia) although we stick to our focus of en-coding structured KG
Specifically we fine-tune BERT-BASE BERT-LARGE (Devlin et al 2019) and ROBERTA (Liuet al 2019b) for multiple-choice questions Wetake RGCN (Eq 2 in sect3) RN5 (Eq 3 in sect3)KagNet (Eq 4 in sect3) and GconAttn (Wanget al 2019) as baselines GconAttn generalizesmatch-LSTM (Wang and Jiang 2016) and achievessuccess in language inference tasks
6 Results and Discussions
In this section we present the results of our modelsin comparison with baselines as well as methodson the leaderboards for both CommonsenseQA andOpenbookQA We also provide analysis of modelsrsquocomponents and characteristics
5We use mean pooling for 1-hop RN and attentive poolingfor 2-hop RN (detailed in Appendix E)
Methods Dev () Test ()
T5-3Bdagger (Raffel et al 2019) - 8320UnifiedQAdagger (Khashabi et al 2020) - 8720
RoBERTa-Large (wo KG) 6676 (plusmn114) 6480 (plusmn237)+ RGCN 6465 (plusmn196) 6245 (plusmn157)+ GconAttn 6430 (plusmn099) 6190 (plusmn244)+ RN (1-hop) 6485 (plusmn111) 6365 (plusmn231)+ RN (2-hop) 6700 (plusmn071) 6520 (plusmn118)+ MHGRN (K = 3) 6810 (plusmn102) 6685 (plusmn119)
AristoRoBERTaV7dagger 792 778+ MHGRN (K = 3) 786 806
Table 5 Performance comparison on OpenbookQAdagger indicates reported results on leaderboard T5-3B is 8times larger than our models UnifiedQA is 30x larger
61 Main ResultsFor CommonsenseQA (Table 3) we first use thein-house data split (IH) of Lin et al (2019) (cfAppendix B) to compare our models with imple-mented baselines This is different from the offi-cial split used in the leaderboard methods Almostall KG-augmented models achieve performancegain over vanilla pre-trained LMs demonstratingthe value of external knowledge on this dataset Ad-ditionally we evaluate our MHGRN (with text en-coder being ROBERTA-LARGE) on official splitOF (Table 4) for fair comparison with other meth-ods on leaderboard in both single-model settingand ensemble-model setting With backbone be-ing T5 (Raffel et al 2019) UnifiedQA (Khashabiet al 2020) tops the leaderboard Consideringits training cost we do not intend to compare ourROBERTA-based model with it We achieve thebest performances among other models
For OpenbookQA (Table 5) we use official splitand build models with ROBERTA-LARGE as textencoder MHGRN surpasses all implemented base-lines with an absolute increase of sim2 on TestAlso as our approach is naturally compatible withthe methods that utilize textual knowledge or ex-tra data because in our paradigm the encodingof textual statement and graph are structurally-decoupled (Fig 3) To empirically show MHGRNcan bring gain over textual-knowledge empow-ered systems we replace our text encoder withAristoRoBERTaV76 and fine-tune our MHGRNupon OpenbookQA Empirically MHGRN contin-ues to bring benefit to strong-performing textual-knowledge empowered systems This indicatestextual knowledge and structured knowledge canpotentially be complementary
6httpsleaderboardallenaiorgopen_book_qasubmissionblcp1tu91i4gm0vf484g
Methods IHdev-Acc ()
MHGRN (K = 3) 7445 (plusmn010)- Type-specific transformation (sect41) 7316 (plusmn028)- Structured relational attention (sect42) 7326 (plusmn031)- Relation type attention (sect42) 7355 (plusmn068)- Node type attention (sect42) 7392 (plusmn065)
Table 6 Ablation study on model components (re-moving one component each time) using ROBERTA-LARGE as the text encoder We report the IHdev ac-curacy on CommonsenseQA
62 Performance Analysis
Ablation Study on Model Components Asshown in Table 6 disabling type-specific trans-formation results in sim 13 drop in performancedemonstrating the need for distinguishing nodetypes for QA tasks Our structured relational at-tention mechanism is also critical with its twosub-components contributing almost equally
Impact of the Amount of Training Data Weuse different fractions of training data of Common-senseQA and report results of fine-tuning text en-coders alone and jointly training text encoder andgraph encoder in Fig 5 Regardless of training datafraction our model shows consistently more per-formance improvement over knowledge-agnosticfine-tuning compared with the other graph encod-ing methods indicating MHGRNrsquos complementarystrengths to text encoders
20 40 60 80 100Percentage of Training Data
45
50
55
60
65
Accu
racy
()
MultiGRNKVMGconAttnwo KG
Figure 5 Performance change (accuracy in ) wrtthe amounts of training data on CommonsenseQA IHT-est set (same as Lin et al (2019))
Impact of Number of Hops (K) We investigatethe impact of hyperparameter K for MHGRN byits performance on CommonsenseQA (Fig 6) Theincrease ofK continues to bring benefits untilK =
4 However performance begins to drop whenK gt 3 This might be attributed to exponentialnoise in longer relational paths in knowledge graph
65 70 75Accuracy ()
K=1
K=2
K=3
K=4
K=5
Reasoning Hops
7324
7393
7465
7365
7416
Figure 6 Effect ofK in MHGRN We show IHDev ac-curacy of MHGRN on CommonsenseQA wrt hops
2 4 6 800
05
10
15
20
sba
tch
K Reasoning Hops
RGCNMHGRN
Figure 7 Analysis of model scalability Comparisonof per-batch training efficiency wrt hops K
63 Model ScalabilityFig 7 presents the computation cost of MHGRNand RGCN (measured by training time) Both growlinearly wrt K Although the theoretical complex-ity of MultiRGN ism times that of RGCN the ratioof their empirical cost only approaches 2 demon-strating that our model can be better parallelized
64 Model InterpretabilityWe can analyze our modelrsquos reasoning process bydecoding the reasoning path using the method de-scribed in sect45 Fig 8 shows two examples fromCommonsenseQA where our model correctly an-swers the questions and provides reasonable pathevidences In the example on the left the modellinks question entities and answer entity in a chainto support reasoning while the example on the rightprovides a case where our model leverage unmen-tioned entities to bridge the reasoning gap betweenquestion entity and answer entities in a way that iscoherent with the latent relation between CHAPEL
and the desired answer in the question
7 Related WorkKnowledge-Aware Methods for NLP Variouswork have investigated the potential to empowerNLP models with external knowledge Many at-tempt to extract structured knowledge either in theform of nodes (Yang and Mitchell 2017 Wanget al 2019) triples (Weissenborn et al 2017
Where is known for a multitude of wedding chapels
A town B texasC city D church building E Nevada
RenoChapel
PartOf
AtLocation
UsedForminus1 Synonym
AtLocation Synonym
Neveda
Baseball
Play_Baseball
HasProperty
UsedForminus1
Fun_to_Play
Why do parents encourage their kids to play baseball
A round B cheap C break window D hardE fun to play Fun_to_Play
HasProperty
Reno
Chapel
AtLocation
Neveda
PartOf
MultiGRNFigure 8 Case study on model interpretability Wepresent two sample questions from CommonsenseQAwith the reasoning paths output by MHGRN
Mihaylov and Frank 2018) paths (Bauer et al2018 Kundu et al 2019 Lin et al 2019) or sub-graphs (Li and Clark 2015) and encode them toaugment textual understanding
Recent success of pre-trained LMs motivatesmany (Pan et al 2019 Ye et al 2019 Zhang et al2018 Li et al 2019 Banerjee et al 2019) to probeLMsrsquo potential as latent knowledge bases This lineof work turn to textual knowledge (eg Wikipedia)to directly impart knowledge to pre-trained LMsThey generally fall into two paradigms 1) Fine-tuning LMs on large-scale general-domain datasets(eg RACE (Lai et al 2017)) or on knowledge-richtext 2) Providing LMs with evidence via informa-tion retrieval techniques However these modelscannot provide explicit reasoning and evidencethus hardly trustworthy They are also subject tothe availability of in-domain datasets and maxi-mum input token of pre-trained LMs
Neural Graph Encoding Graph Attention Net-works (GAT) (Velickovic et al 2018) incorpo-rates attention mechanism in feature aggregationRGCN (Schlichtkrull et al 2018) proposes rela-tional message passing which makes it applicableto multi-relational graphs However they only per-form single-hop message passing and cannot beinterpreted at path level Other work (Abu-El-Haijaet al 2019 Nikolentzos et al 2019) aggregate fora node its K-hop neighbors based on node-wisedistances but they are designed for non-relationalgraphs MHGRN addresses these issues by rea-soning on multi-relational graphs and being inter-pretable via maintaining paths as reasoning chains
8 Conclusion
We present a principled scalable method MHGRNthat can leverage general knowledge via multi-hopreasoning over interpretable structures (eg Con-ceptNet) The proposed MHGRN generalizes andcombines the advantages of GNNs and path-basedreasoning models It explicitly performs multi-hop
relational reasoning and is empirically shown tooutperform existing methods with superior scal-ablility and interpretability
ReferencesSami Abu-El-Haija Bryan Perozzi Amol Kapoor
Nazanin Alipourfard Kristina Lerman Hrayr Haru-tyunyan Greg Ver Steeg and Aram Galstyan 2019Mixhop Higher-order graph convolutional architec-tures via sparsified neighborhood mixing In Pro-ceedings of the 36th International Conference onMachine Learning ICML 2019 9-15 June 2019Long Beach California USA volume 97 of Pro-ceedings of Machine Learning Research pages 21ndash29 PMLR
Pratyay Banerjee Kuntal Kumar Pal Arindam Mi-tra and Chitta Baral 2019 Careful selection ofknowledge to solve open book question answeringIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages6120ndash6129 Florence Italy Association for Compu-tational Linguistics
Lisa Bauer Yicheng Wang and Mohit Bansal 2018Commonsense for generative multi-hop question an-swering tasks In Proceedings of the 2018 Confer-ence on Empirical Methods in Natural LanguageProcessing Brussels Belgium October 31 - Novem-ber 4 2018 pages 4220ndash4230 Association for Com-putational Linguistics
Peter Clark Isaac Cowhey Oren Etzioni Tushar KhotAshish Sabharwal Carissa Schoenick and OyvindTafjord 2018 Think you have solved questionanswering try arc the AI2 reasoning challengeCoRR abs180305457
Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics
Justin Gilmer Samuel S Schoenholz Patrick F RileyOriol Vinyals and George E Dahl 2017 Neuralmessage passing for quantum chemistry In Pro-ceedings of the 34th International Conference onMachine Learning ICML 2017 Sydney NSW Aus-tralia 6-11 August 2017 volume 70 of Proceedingsof Machine Learning Research pages 1263ndash1272PMLR
Daniel Khashabi Tushar Khot Ashish SabharwalOyvind Tafjord Peter Clark and Hannaneh Ha-jishirzi 2020 Unifiedqa Crossing format bound-aries with a single qa system
Thomas N Kipf and Max Welling 2017 Semi-supervised classification with graph convolutionalnetworks In 5th International Conference on Learn-ing Representations ICLR 2017 Toulon FranceApril 24-26 2017 Conference Track ProceedingsOpenReviewnet
Souvik Kundu Tushar Khot Ashish Sabharwal andPeter Clark 2019 Exploiting explicit paths formulti-hop reading comprehension In Proceedingsof the 57th Annual Meeting of the Association forComputational Linguistics pages 2737ndash2747 Flo-rence Italy Association for Computational Linguis-tics
John D Lafferty Andrew McCallum and FernandoC N Pereira 2001 Conditional random fieldsProbabilistic models for segmenting and labeling se-quence data In Proceedings of the Eighteenth Inter-national Conference on Machine Learning (ICML2001) Williams College Williamstown MA USAJune 28 - July 1 2001 pages 282ndash289 MorganKaufmann
Guokun Lai Qizhe Xie Hanxiao Liu Yiming Yangand Eduard Hovy 2017 RACE Large-scale ReAd-ing comprehension dataset from examinations InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing pages785ndash794 Copenhagen Denmark Association forComputational Linguistics
Zhen-Zhong Lan Mingda Chen Sebastian Good-man Kevin Gimpel Piyush Sharma and RaduSoricut 2019 Albert A lite bert for self-supervised learning of language representationsArXiv abs190911942
Shiyang Li Jianshu Chen and Dian Yu 2019 Teach-ing pretrained models with commonsense reason-ing A preliminary kb-based approach CoRRabs190909743
Yang Li and Peter Clark 2015 Answering elementaryscience questions by constructing coherent scenesusing background knowledge In Proceedings ofthe 2015 Conference on Empirical Methods in Nat-ural Language Processing pages 2007ndash2012 Lis-bon Portugal Association for Computational Lin-guistics
Bill Yuchen Lin Xinyue Chen Jamin Chen and Xi-ang Ren 2019 KagNet Knowledge-aware graphnetworks for commonsense reasoning In Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) pages 2829ndash2839 HongKong China Association for Computational Lin-guistics
Liyuan Liu Haoming Jiang Pengcheng He WeizhuChen Xiaodong Liu Jianfeng Gao and Jiawei Han2019a On the variance of the adaptive learning rateand beyond CoRR abs190803265
Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019bRoberta A robustly optimized BERT pretraining ap-proach CoRR abs190711692
Shangwen Lv Daya Guo Jingjing Xu Duyu TangNan Duan Ming Gong Linjun Shou Daxin JiangGuihong Cao and Songlin Hu 2019 Graph-based reasoning over heterogeneous external knowl-edge for commonsense question answering CoRRabs190905311
Kaixin Ma Jonathan Francis Quanyang Lu Eric Ny-berg and Alessandro Oltramari 2019 Towardsgeneralizable neuro-symbolic systems for common-sense question answering In Proceedings of theFirst Workshop on Commonsense Inference in Natu-ral Language Processing pages 22ndash32 Hong KongChina Association for Computational Linguistics
Todor Mihaylov Peter Clark Tushar Khot and AshishSabharwal 2018 Can a suit of armor conduct elec-tricity a new dataset for open book question an-swering In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processingpages 2381ndash2391 Brussels Belgium Associationfor Computational Linguistics
Todor Mihaylov and Anette Frank 2018 Knowledge-able reader Enhancing cloze-style reading compre-hension with external commonsense knowledge InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 821ndash832 Melbourne AustraliaAssociation for Computational Linguistics
Giannis Nikolentzos George Dasoulas and MichalisVazirgiannis 2019 k-hop graph neural networksCoRR abs190706051
Xiaoman Pan Kai Sun Dian Yu Heng Ji and Dong Yu2019 Improving question answering with externalknowledge CoRR abs190200993
Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2019 Exploring the limitsof transfer learning with a unified text-to-text trans-former
Adam Santoro David Raposo David G T BarrettMateusz Malinowski Razvan Pascanu Peter WBattaglia and Tim Lillicrap 2017 A simple neu-ral network module for relational reasoning In Ad-vances in Neural Information Processing Systems30 Annual Conference on Neural Information Pro-cessing Systems 2017 4-9 December 2017 LongBeach CA USA pages 4967ndash4976
Maarten Sap Hannah Rashkin Derek Chen RonanLe Bras and Yejin Choi 2019 Social IQa Com-monsense reasoning about social interactions InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the
9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) pages 4463ndash4473 Hong Kong China Association for Computa-tional Linguistics
Michael Sejr Schlichtkrull Thomas N Kipf PeterBloem Rianne van den Berg Ivan Titov and MaxWelling 2018 Modeling relational data with graphconvolutional networks In The Semantic Web - 15thInternational Conference ESWC 2018 HeraklionCrete Greece June 3-7 2018 Proceedings volume10843 of Lecture Notes in Computer Science pages593ndash607 Springer
Robyn Speer Joshua Chin and Catherine Havasi 2017Conceptnet 55 An open multilingual graph of gen-eral knowledge In Proceedings of the Thirty-FirstAAAI Conference on Artificial Intelligence Febru-ary 4-9 2017 San Francisco California USApages 4444ndash4451 AAAI Press
Alon Talmor Jonathan Herzig Nicholas Lourie andJonathan Berant 2019 CommonsenseQA A ques-tion answering challenge targeting commonsenseknowledge In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4149ndash4158 Minneapolis Minnesota Associ-ation for Computational Linguistics
Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez LukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems 30 Annual Conference on NeuralInformation Processing Systems 2017 4-9 Decem-ber 2017 Long Beach CA USA pages 5998ndash6008
Petar Velickovic Guillem Cucurull Arantxa CasanovaAdriana Romero Pietro Lio and Yoshua Bengio2018 Graph attention networks In 6th Inter-national Conference on Learning RepresentationsICLR 2018 Vancouver BC Canada April 30 - May3 2018 Conference Track Proceedings OpenRe-viewnet
Shuohang Wang and Jing Jiang 2016 Learning nat-ural language inference with LSTM In Proceed-ings of the 2016 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics Human Language Technologies pages1442ndash1451 San Diego California Association forComputational Linguistics
Xiaoyan Wang Pavan Kapanipathi Ryan MusaMo Yu Kartik Talamadupula Ibrahim AbdelazizMaria Chang Achille Fokoue Bassem MakniNicholas Mattei and Michael Witbrock 2019 Im-proving natural language inference using externalknowledge in the science questions domain In TheThirty-Third AAAI Conference on Artificial Intelli-gence AAAI 2019 The Thirty-First Innovative Ap-plications of Artificial Intelligence Conference IAAI2019 The Ninth AAAI Symposium on Educational
Advances in Artificial Intelligence EAAI 2019 Hon-olulu Hawaii USA January 27 - February 1 2019pages 7208ndash7215 AAAI Press
Dirk Weissenborn Tomas Kocisky and Chris Dyer2017 Dynamic integration of background knowl-edge in neural nlu systems arXiv preprintarXiv170602596
An Yang Quan Wang Jing Liu Kai Liu Yajuan LyuHua Wu Qiaoqiao She and Sujian Li 2019 En-hancing pre-trained language representations withrich knowledge for machine reading comprehensionIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2346ndash2357 Florence Italy Association for Compu-tational Linguistics
Bishan Yang and Tom Mitchell 2017 Leveragingknowledge bases in LSTMs for improving machinereading In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1 Long Papers) pages 1436ndash1446 Van-couver Canada Association for Computational Lin-guistics
Zhi-Xiu Ye Qian Chen Wen Wang and Zhen-HuaLing 2019 Align mask and select A sim-ple method for incorporating commonsense knowl-edge into language representation models CoRRabs190806725
Yuyu Zhang Hanjun Dai Kamil Toraman andLe Song 2018 Kgˆ2 Learning to reason scienceexam questions with contextual knowledge graphembeddings CoRR abs180512393
Chen Zhu Yu Cheng Zhe Gan Siqi Sun Tom Gold-stein and Jingjing Liu 2020 Freelb Enhanced ad-versarial training for natural language understandingIn International Conference on Learning Represen-tations
A Merging Types of Relations inConceptNet
Relation Merged Relation
AtLocationAtLocation
LocatedNear
CausesCausesCausesDesire
MotivatedByGoal
AntonymAntonym
DistinctFrom
HasSubevent
HasSubevent
HasFirstSubeventHasLastSubeventHasPrerequisite
EntailsMannerOf
IsAIsAInstanceOf
DefinedAs
PartOfPartOf
HasA
RelatedToRelatedToSimilarTo
Synonym
Table 7 Relations in ConceptNet that are being mergedin pre-processing RelationX indicates the reverse re-lation of RelationX
We merge relations that are close in semanticsas well as in the general usage of triple instancesin ConceptNet
B Dataset Split Specifications
Train Dev Test
CommonsenseQA (OF) 9 741 1 221 1 140CommonsenseQA (IH) 8 500 1 221 1 241OpenbookQA 4 957 500 500
Table 8 Numbers of instances in different datasetsplits
CommonsenseQA 7 and OpenbookQA 8 all havetheir leaderboards with training and development
7httpswwwtau-nlporgcommonsenseqa8httpsleaderboardallenaiorgopen_
book_qasubmissionspublic
set publicly available As the ground truth labelsfor CommonsenseQA are not readily available formodel analysis we take 1241 examples from of-ficial training examples as our in-house test ex-amples and regard the remaining 8500 ones asour in-house training examples (CommonsenseQA(IH))
C Implementation Details
CommonsenseQA OpenbookQA
BERT-BASE 3 times 10minus5 -
BERT-LARGE 2 times 10minus5 -
ROBERTA-LARGE 1 times 10minus5
1 times 10minus5
Table 9 Learning rate for text encoders on differentdatasets
CommonsenseQA OpenbookQA
RN 3 times 10minus4
3 times 10minus4
RGCN 1 times 10minus3
1 times 10minus3
GconAttn 3 times 10minus4
1 times 10minus4
MHGRN 1 times 10minus3
1 times 10minus3
Table 10 Learning rate for graph encoders on differentdatasets
Param
RN 399KRGCN 365KGconAttn 453KMHGRN 544K
Table 11 Numbers of parameters of different graph en-coders
Our models are implemented in PyTorch We usecross-entropy loss and RAdam (Liu et al 2019a)optimizer We find it beneficial to use separatelearning rates for the text encoder and the graph en-coder We tune learning rates for text encoders andgraph encoders on two datasets We first fine-tuneROBERTA-LARGE BERT-LARGE BERT-BASE
on CommonsenseQA and ROBERTA-LARGE onOpenbookQA respectively and choose a dataset-specific learning rate from 1times10
minus5 2times10
minus5 3times
10minus5 6 times 10
minus5 1 times 10
minus4 for each text encoderbased on the best performance on development setas listed in Table 9 We report the performance ofthese fine-tuned text encoders and also adopt theirdataset-specific optimal learning rates in joint train-ing with graph encoders For models that involve
KG the learning rate of their graph encoders arechosen from 1 times 10
minus4 3 times 10
minus4 1 times 10
minus3 3 times
10minus3 based on their best development set perfor-
mance with ROBERTA-LARGE as the text encoderWe report the optimal learning rates for graph en-coders in Table 10 In training we set the max-imum input sequence length to text encoders to64 batch size to 32 and perform early stoppingAristoRoBERTaV7+MHGRN is the only exceptionIn order to host fair comparison we follow Aris-toRoBERTaV7 and set the batch size to 16 maxinput sequence length to 256 and choose a decoderlearning rate from 1 times 10
minus3 2 times 10
minus5For the input node features we first use tem-
plates to turn knowledge triples in ConceptNet intosentences and feed them into pre-trained BERT-LARGE obtaining a sequence of token embeddingsfrom the last layer of BERT-LARGE for each tripleFor each entity in ConceptNet we perform meanpooling over the tokens of the entityrsquos occurrencesacross all the sentences to form a 1024d vector asits corresponding node feature We use this set offeatures for all our implemented models
We use 2-layer RGCN and single-layer MHGRNacross our experiments
The numbers of parameter for each graph en-coder are listed in Table 11
D Dynamic Programming Algorithm forEq 7
To show that multi-hop message passing can becomputed in linear time we observe that Eq 7 canbe re-written in matrix form
Zk= (Dk)minus1 sum
(r1rk)isinRk
β(r1 rk s)
sdotGArk⋯Ar1FXW1r1
⊤⋯W
krk
⊤
sdotWk+10
⊤⋯W
K0
⊤ (1 le k le K) (13)
where G = diag(exp([g(φ(v1) s) g(φ(vn)s)]) (F is similarly defined) Ar is the adjacencymatrix for relation r and D
k is defined as follows
Dk= diag( sum
(r1rk)isinRk
β(r1 rk s)
sdotGArk⋯Ar1FX1) (1 le k le K) (14)
Using this matrix formulation we can compute Eq7 using dynamic programming
Algorithm 1 Dynamic programming algorithm formulti-hop message passingInput sXAr(1 le r le m)W t
r (r isin R 1 le t le
k)F G δ τOutput Z
1 WKlarr I
2 for k larr K minus 1 to 1 do3 W
klarrW
k+10 W
k+1
4 end for5 for r isin R do6 Mr larr FX7 end for8 for k larr 1 to K do9 if k gt 1 then
10 for r isin R do11 M
primer larr e
δ(rs)ArsumrprimeisinR e
τ(rprimers)MrprimeW
kr
⊤
12 end for13 for r isin R do14 Mr larrM
primer
15 end for16 else17 for r isin R do18 Mr larr e
δ(rs) sdotArMrWkr
⊤
19 end for20 end if21 Z
klarr GsumrisinR MrW
k
22 end for23 Replace W
tr (0 le r le m 1 le t le k) with identity
matrices and X with 1 and re-run line 1 - line 19 tocompute d
1 d
K
24 for k larr 1 to K do25 Z
klarr (diag(dk))minus1
Zk
26 end for27 return Z
1Z
2 Z
k
E Formal Definition of K-hop RN
Definition 1 (K-hop Relation Network) A multi-hop relation network is a function that maps amulti-relational graph to a fixed size vector
KHopRN(G W E H) =K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(j r1 rk i)sdotW (hjoplus(er1⋯erk)oplushi)(15)
where denotes element-wise product and β(⋯) =1(K∣A∣ sdot ∣(j prime i) isin G ∣ j prime isin Q∣) defines thepooling weights
F Expressing K-hop RN with MultiGRN
Theorem 1 Given any W E H there exists a pa-rameter setting such that the output of the modelbecomes KHopRN(GW E H) for arbitrary G
Proof Suppose W = [W1 W2 W3] whereW1 W3 isin Rd3timesd1 W2 isin Rd3timesd2 ForMHRGN we set the parameters as followsH = HUlowast = [I0] isin R(d1+d2)timesd1 blowast =
[01]⊤ isin Rd1+d2 Wtr = diag(1 oplus er) isin
R(d1+d2)times(d1+d2)(r isin R 1 le t le K)V = W3 isin
Rd3timesd1 Vprime= [W1 W2] isin Rd3times(d1+d2) We dis-
able the relation type attention module and enablemessage passing only from Q to A By furtherchoosing σ as the identity function and perform-ing pooling over A we observe that the output of
MultiGRN becomes
1
∣A∣ sumiisinA
hprimei
=1
∣A∣ (V hi + Vprimezi)
=1
K∣A∣K
sumk=1
(V hi + Vprimezki )
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(j r1 rk i)(V hi+
VprimeW
krk⋯W
1r1xj)
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ
β(⋯)(V hi + VprimeW
krk
⋯W1r1Uφ(j)hj + V
primeW
krk⋯W
1r1bφ(j))
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(⋯)(W3hi + W1hj
+ W2(er1 ⋯ erk))
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(⋯)W (hjoplus
(er1 ⋯ erk)oplus hi)= RN(G W E H)
(16)
AttentivePooling
2 Multi-hop Message Passing
Node Features
Text Encodereg BERT
1 Node Type Specific Transform
3 Nonlinear Activation
timesLayer
PlausibilityScore
MLP
120022 119956
AggregateMessages
120630(119947 119955120783 hellip 119955119948 119946)
Structured Relational Att
Learn
Child
Child CapableOf UsedForminus1 AtLocation
Desires UsedForminus1 Synonym
AtLocation Synonym
Sit AtLocationUsedForminus1
Desk AtLocation
hellipMulti-hop Message Passing
Figure 4 Our proposed MHGRN architecturefor relational reasoning MHGRN takes a multi-relational graph G and a (question-answer) statementvector s as input and outputs a scalar that represent theplausibility score of this statement
41 MHGRN Model ArchitectureWe follow the GNN framework introduced in sect3where node features can be initialized with pre-trained weights (details in Appendix C) Here wefocus on the computation of node embeddings
Type-Specific Transformation To make ourmodel aware of the node type φ we first performnode type specific linear transformation on the in-put node features
xi = Uφ(i)hi + bφ(i) (5)
where the learnable parameters U and b are specificto the type of node i
Multi-Hop Message Passing As mentioned be-fore our motivation is to endow GNNs with thecapability of directly modeling paths To this endwe propose to pass messages directly over all therelational paths of lengths up to K The set of validk-hop relational paths is defined as
Φk = (j r1 rk i) ∣ (j r1 j1)⋯ (jkminus1 rk i) isin E (1 le k le K) (6)
We perform k-hop (1 le k le K) message passingover these paths which is a generalization of the
layer In practice multiple layers are stacked to enable mes-sage passing from multi-hop neighbors
single-hop message passing in RGCNs (see Eq 2)
zki = sum
(jr1rki)isinΦk
α(j r1 rk i)dki sdotWK0
⋯Wk+10 W
krk⋯W
1r1xj (1 le k le K) (7)
where the Wtr (1 le t le K 0 le r le m) matrices
are learnable3 α(j r1 rk i) is an attentionscore elaborated in sect42 and d
ki = sum(j⋯i)isinΦk
α(j⋯i)is the normalization factor The W k
rk⋯W1r1 ∣ 1 le
r1 rk le m matrices can be interpreted as thelow rank approximation of a mtimes⋯timesmktimesdtimesdtensor that assigns a separate transformation foreach k-hop relation where d is the dim of xi
Incoming messages from paths of differentlengths are aggregated via attention mecha-nism (Vaswani et al 2017)
zi =K
sumk=1
softmax(bilinear(s zki )) sdot zki (8)
Non-linear Activation Finally we apply shortcutconnection and nonlinear activation to obtain theoutput node embeddings
hprimei = σ (V hi + V
primezi) (9)
where V and Vprime are learnable model parameters
and σ is a non-linear activation function
42 Structured Relational AttentionHere we work towards effectively parameterizingthe attention score α(j r1 rk i) in Eq 7 for allk-hop paths without introducing O(mk) parametersWe first regard it as the probability of a relationsequence (φ(j) r1 rk φ(i)) conditioned on s
α(j r1 rk i) = p (φ(j) r1 rk φ(i) ∣ s) (10)
which can naturally be modeled by a probabilisticgraphical model such as conditional random field(Lafferty et al 2001)
p (⋯ ∣ s)prop exp(f(φ(j) s) +k
sumt=1
δ(rt s)
+kminus1
sumt=1
τ(rt rt+1) + g(φ(i) s))
∆= β(r1 rk s)IacuteOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveNtildeOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveIuml
Relation Type Attention
sdot γ(φ(j) φ(i) s)IacuteOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveNtildeOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveIumlNode Type Attention
(11)
3W
t0(0 le t le K) are introduced as padding matrices
so that K transformations are applied regardless of k thusensuring comparable scale of zki across different k
Model Time Space
G is a dense graph
K-hop KagNet O (mKnK+1
K) O (mKnK+1
K)K-layer RGCN O (mn2
K) O (mnK)MHGRN O (m2
n2K) O (mnK)
G is a sparse graph with maximum node degree ∆ ≪ n
K-hop KagNet O (mKnK∆
K) O (mKnK∆
K)K-layer RGCN O (mnK∆) O (mnK)MHGRN O (m2
nK∆) O (mnK)
Table 2 Computation complexity of different K-hopreasoning models on a densesparse multi-relationalgraph with n nodes and m relation types Despite thequadratic complexity wrt m MHGRNrsquos time cost issimilar to RGCN on GPUs with parallelizable matrixmultiplications (cf Fig 7)
where f(sdot) δ(sdot) and g(sdot) are parameterized bytwo-layer MLPs and τ(sdot) by a transition matrixof shape m timesm Intuitively β(sdot) models the im-portance of a k-hop relation while γ(sdot) models theimportance of messages from node type φ(j) toφ(i) (eg the model can learn to pass messagesonly from question entities to answer entities)
Our model scores a k-hop relation by decom-posing it into both context-aware single-hop re-lations (modeled by δ) and two-hop relations(modeled by τ ) We argue that τ is indis-pensable without which the model may assignhigh importance to illogical multi-hop relations(eg [AtLocation CapableOf]) or noisy re-lations (eg [RelatedTo RelatedTo])
43 Computation Complexity Analysis
Although message passing process in Eq 7 andattention module in Eq11 handles potentially ex-ponential number of paths it can be computed inlinear time with dynamic programming (see Ap-pendix D) As summarized in Table 2 time com-plexity and space complexity of MHGRN on asparse graph are both linear wrt either the maxi-mum path length K or the number of nodes n
44 Expressive Power of MHGRN
In addition to efficiency and scalability we now dis-cuss the modeling capacity of MHGRN With themessage passing formulation and relation-specifictransformations it is by nature the generalizationof RGCN It is also capable of directly modelingpaths making it interpretable as are path-basedmodels like RN and KagNet To show this we
first generalize RN (Eq 3) to the multi-hop set-ting and introduce K-hop RN (formal definition inAppendix E) which models multi-hop relation asthe composition of single-hop relations We showthat MHGRN is capable of representingK-hop RN(proof in Appendix F)
45 Learning Inference and Path DecodingWe now discuss the learning and inference processof MHGRN instantiated for QA tasks Followingthe problem formulation in sect2 we aim to deter-mine the plausibility of an answer option a isin Cgiven the question q with the information fromboth text s and graph G We first obtain the graphrepresentation g by performing attentive poolingover the output node embeddings of answer entitieshprimei ∣ i isin A Next we concatenate it with the textrepresentation s and compute the plausibility scoreby ρ(qa) = MLP(soplus g)
During training we maximize the plausibilityscore of the correct answer a by minimizing thecross-entropy loss
L = EqaC [minus logexp(ρ(q a))
sumaisinC exp(ρ(qa))] (12)
The whole model is trained end-to-end jointly withthe text encoder (eg RoBERTa)
During inference we predict the most plau-sible answer by argmaxaisinC ρ(qa) Addition-ally we can decode a reasoning path as evidencefor model predictions endowing our model withthe interpretability enjoyed by path-based mod-els Specifically we first determine the answerentity ilowast with the highest score in the pooling layerand the path length klowast with the highest score inEq 8 Then the reasoning path is decoded byargmax α(j r1 rklowast i
lowast) which can be com-puted in linear time using dynamic programming
5 Experimental SetupWe introduce how we construct G (sect51) thedatasets (sect52) as well as the baseline methods(sect53) Appendix C shows more implementationand experimental details for reproducibility
51 Extracting G from External KGWe use ConceptNet (Speer et al 2017) a general-domain knowledge graph as our external KG to testmodelsrsquo ability to harness structured knowledgesource Following KagNet (Lin et al 2019) wemerge relation types to increase graph density andadd reverse relations to construct a multi-relational
Methods BERT-Base BERT-Large RoBERTa-Large
IHdev-Acc() IHtest-Acc() IHdev-Acc() IHtest-Acc() IHdev-Acc() IHtest-Acc()
wo KG 5731 (plusmn107) 5347 (plusmn087) 6106 (plusmn085) 5539 (plusmn040) 7307 (plusmn045) 6869(plusmn056)
RGCN (Schlichtkrull et al 2018) 5694 (plusmn038) 5450 (plusmn056) 6298 (plusmn082) 5713 (plusmn036) 7269 (plusmn019) 6841 (plusmn066)GconAttn (Wang et al 2019) 5727 (plusmn070) 5484 (plusmn088) 6317 (plusmn018) 5736 (plusmn090) 7261( plusmn039) 6859 (plusmn096)KagNetdagger (Lin et al 2019) 5557 5619 6235 5716 - -RN (1-hop) 5827 (plusmn022) 5620 (plusmn045) 6304 (plusmn058) 5846 (plusmn071) 7457 (plusmn091) 6908 (plusmn021)RN (2-hop) 5981 (plusmn076) 5661 (plusmn068) 6336 (plusmn026) 5892 (plusmn014) 7365 (plusmn309) 6959 (plusmn380)
MHGRN 6036 (plusmn023) 5723 (plusmn082) 6329(plusmn051) 6059 (plusmn058) 7445 (plusmn010) 7111 (plusmn081)
Table 3 Performance comparison on CommonsenseQA in-house split We report in-house Dev (IHdev) andTest (IHtest) accuracy (mean and standard deviation of four runs) using the data split of Lin et al (2019) onCommonsenseQA dagger indicates reported results in its paper
Methods Single Ensemble
UnifiedQAdagger (Khashabi et al 2020) 791 -
RoBERTadagger 721 725RoBERTa + KEDGNdagger 725 744RoBERTa + KEdagger 733 -RoBERTa + HyKAS 20dagger (Ma et al 2019) 732 -RoBERTa + FreeLBdagger(Zhu et al 2020) 722 731XLNet + DREAMdagger 669 733XLNet + GRdagger (Lv et al 2019) 753 -ALBERTdagger (Lan et al 2019) - 765
RoBERTa + MHGRN (K = 2) 754 765
Table 4 Performance comparison on official test ofCommonsenseQA with leaderboard SoTAs4 (accuracyin ) dagger indicates reported results on leaderboard Uni-fiedQA uses T5-11B as text encoder whose number ofparameters is about 30 times more than other models
graph with 34 relation types (details in AppendixA) To extract an informative contextualized graphG from KG we recognize entity mentions in s andlink them to entities in ConceptNet with which weinitialize our node set V We then add to V all theentities that appear in any two-hop paths betweenpairs of mentioned entities Unlike KagNet we donot perform any pruning but instead reserve all theedges between nodes in V forming our G
52 Datasets
We evaluate models on two multiple-choice ques-tion answering datasets CommonsenseQA andOpenbookQA Both require world knowledge be-yond textual understanding to perform well
CommonsenseQA (Talmor et al 2019) neces-sitates various commonsense reasoning skills Thequestions are created with entities from ConceptNetand they are designed to probe latent compositionalrelations between entities in ConceptNet
OpenBookQA (Mihaylov et al 2018) provides
4Models based on ConceptNet are no longer shown on theleaderboard and we got our results from the organizers
elementary science questions together with an openbook of science facts This dataset also probesgeneral common sense beyond the provided facts
53 Compared Methods
We implement both knowledge-agnostic fine-tuning of pre-trained LMs and models that incor-porate KG as external sources as our baselinesAdditionally we directly compare our model withthe results from corresponding leaderboard Thesemethods typically leverage textual knowledge orextra training data as opposed to external KG Inall our implemented models we use pre-trainedLMs as text encoders for s for fair comparisonWe do compare our models with those (Ma et al2019 Lv et al 2019 Khashabi et al 2020) aug-mented by other text-form external knowledge (egWikipedia) although we stick to our focus of en-coding structured KG
Specifically we fine-tune BERT-BASE BERT-LARGE (Devlin et al 2019) and ROBERTA (Liuet al 2019b) for multiple-choice questions Wetake RGCN (Eq 2 in sect3) RN5 (Eq 3 in sect3)KagNet (Eq 4 in sect3) and GconAttn (Wanget al 2019) as baselines GconAttn generalizesmatch-LSTM (Wang and Jiang 2016) and achievessuccess in language inference tasks
6 Results and Discussions
In this section we present the results of our modelsin comparison with baselines as well as methodson the leaderboards for both CommonsenseQA andOpenbookQA We also provide analysis of modelsrsquocomponents and characteristics
5We use mean pooling for 1-hop RN and attentive poolingfor 2-hop RN (detailed in Appendix E)
Methods Dev () Test ()
T5-3Bdagger (Raffel et al 2019) - 8320UnifiedQAdagger (Khashabi et al 2020) - 8720
RoBERTa-Large (wo KG) 6676 (plusmn114) 6480 (plusmn237)+ RGCN 6465 (plusmn196) 6245 (plusmn157)+ GconAttn 6430 (plusmn099) 6190 (plusmn244)+ RN (1-hop) 6485 (plusmn111) 6365 (plusmn231)+ RN (2-hop) 6700 (plusmn071) 6520 (plusmn118)+ MHGRN (K = 3) 6810 (plusmn102) 6685 (plusmn119)
AristoRoBERTaV7dagger 792 778+ MHGRN (K = 3) 786 806
Table 5 Performance comparison on OpenbookQAdagger indicates reported results on leaderboard T5-3B is 8times larger than our models UnifiedQA is 30x larger
61 Main ResultsFor CommonsenseQA (Table 3) we first use thein-house data split (IH) of Lin et al (2019) (cfAppendix B) to compare our models with imple-mented baselines This is different from the offi-cial split used in the leaderboard methods Almostall KG-augmented models achieve performancegain over vanilla pre-trained LMs demonstratingthe value of external knowledge on this dataset Ad-ditionally we evaluate our MHGRN (with text en-coder being ROBERTA-LARGE) on official splitOF (Table 4) for fair comparison with other meth-ods on leaderboard in both single-model settingand ensemble-model setting With backbone be-ing T5 (Raffel et al 2019) UnifiedQA (Khashabiet al 2020) tops the leaderboard Consideringits training cost we do not intend to compare ourROBERTA-based model with it We achieve thebest performances among other models
For OpenbookQA (Table 5) we use official splitand build models with ROBERTA-LARGE as textencoder MHGRN surpasses all implemented base-lines with an absolute increase of sim2 on TestAlso as our approach is naturally compatible withthe methods that utilize textual knowledge or ex-tra data because in our paradigm the encodingof textual statement and graph are structurally-decoupled (Fig 3) To empirically show MHGRNcan bring gain over textual-knowledge empow-ered systems we replace our text encoder withAristoRoBERTaV76 and fine-tune our MHGRNupon OpenbookQA Empirically MHGRN contin-ues to bring benefit to strong-performing textual-knowledge empowered systems This indicatestextual knowledge and structured knowledge canpotentially be complementary
6httpsleaderboardallenaiorgopen_book_qasubmissionblcp1tu91i4gm0vf484g
Methods IHdev-Acc ()
MHGRN (K = 3) 7445 (plusmn010)- Type-specific transformation (sect41) 7316 (plusmn028)- Structured relational attention (sect42) 7326 (plusmn031)- Relation type attention (sect42) 7355 (plusmn068)- Node type attention (sect42) 7392 (plusmn065)
Table 6 Ablation study on model components (re-moving one component each time) using ROBERTA-LARGE as the text encoder We report the IHdev ac-curacy on CommonsenseQA
62 Performance Analysis
Ablation Study on Model Components Asshown in Table 6 disabling type-specific trans-formation results in sim 13 drop in performancedemonstrating the need for distinguishing nodetypes for QA tasks Our structured relational at-tention mechanism is also critical with its twosub-components contributing almost equally
Impact of the Amount of Training Data Weuse different fractions of training data of Common-senseQA and report results of fine-tuning text en-coders alone and jointly training text encoder andgraph encoder in Fig 5 Regardless of training datafraction our model shows consistently more per-formance improvement over knowledge-agnosticfine-tuning compared with the other graph encod-ing methods indicating MHGRNrsquos complementarystrengths to text encoders
20 40 60 80 100Percentage of Training Data
45
50
55
60
65
Accu
racy
()
MultiGRNKVMGconAttnwo KG
Figure 5 Performance change (accuracy in ) wrtthe amounts of training data on CommonsenseQA IHT-est set (same as Lin et al (2019))
Impact of Number of Hops (K) We investigatethe impact of hyperparameter K for MHGRN byits performance on CommonsenseQA (Fig 6) Theincrease ofK continues to bring benefits untilK =
4 However performance begins to drop whenK gt 3 This might be attributed to exponentialnoise in longer relational paths in knowledge graph
65 70 75Accuracy ()
K=1
K=2
K=3
K=4
K=5
Reasoning Hops
7324
7393
7465
7365
7416
Figure 6 Effect ofK in MHGRN We show IHDev ac-curacy of MHGRN on CommonsenseQA wrt hops
2 4 6 800
05
10
15
20
sba
tch
K Reasoning Hops
RGCNMHGRN
Figure 7 Analysis of model scalability Comparisonof per-batch training efficiency wrt hops K
63 Model ScalabilityFig 7 presents the computation cost of MHGRNand RGCN (measured by training time) Both growlinearly wrt K Although the theoretical complex-ity of MultiRGN ism times that of RGCN the ratioof their empirical cost only approaches 2 demon-strating that our model can be better parallelized
64 Model InterpretabilityWe can analyze our modelrsquos reasoning process bydecoding the reasoning path using the method de-scribed in sect45 Fig 8 shows two examples fromCommonsenseQA where our model correctly an-swers the questions and provides reasonable pathevidences In the example on the left the modellinks question entities and answer entity in a chainto support reasoning while the example on the rightprovides a case where our model leverage unmen-tioned entities to bridge the reasoning gap betweenquestion entity and answer entities in a way that iscoherent with the latent relation between CHAPEL
and the desired answer in the question
7 Related WorkKnowledge-Aware Methods for NLP Variouswork have investigated the potential to empowerNLP models with external knowledge Many at-tempt to extract structured knowledge either in theform of nodes (Yang and Mitchell 2017 Wanget al 2019) triples (Weissenborn et al 2017
Where is known for a multitude of wedding chapels
A town B texasC city D church building E Nevada
RenoChapel
PartOf
AtLocation
UsedForminus1 Synonym
AtLocation Synonym
Neveda
Baseball
Play_Baseball
HasProperty
UsedForminus1
Fun_to_Play
Why do parents encourage their kids to play baseball
A round B cheap C break window D hardE fun to play Fun_to_Play
HasProperty
Reno
Chapel
AtLocation
Neveda
PartOf
MultiGRNFigure 8 Case study on model interpretability Wepresent two sample questions from CommonsenseQAwith the reasoning paths output by MHGRN
Mihaylov and Frank 2018) paths (Bauer et al2018 Kundu et al 2019 Lin et al 2019) or sub-graphs (Li and Clark 2015) and encode them toaugment textual understanding
Recent success of pre-trained LMs motivatesmany (Pan et al 2019 Ye et al 2019 Zhang et al2018 Li et al 2019 Banerjee et al 2019) to probeLMsrsquo potential as latent knowledge bases This lineof work turn to textual knowledge (eg Wikipedia)to directly impart knowledge to pre-trained LMsThey generally fall into two paradigms 1) Fine-tuning LMs on large-scale general-domain datasets(eg RACE (Lai et al 2017)) or on knowledge-richtext 2) Providing LMs with evidence via informa-tion retrieval techniques However these modelscannot provide explicit reasoning and evidencethus hardly trustworthy They are also subject tothe availability of in-domain datasets and maxi-mum input token of pre-trained LMs
Neural Graph Encoding Graph Attention Net-works (GAT) (Velickovic et al 2018) incorpo-rates attention mechanism in feature aggregationRGCN (Schlichtkrull et al 2018) proposes rela-tional message passing which makes it applicableto multi-relational graphs However they only per-form single-hop message passing and cannot beinterpreted at path level Other work (Abu-El-Haijaet al 2019 Nikolentzos et al 2019) aggregate fora node its K-hop neighbors based on node-wisedistances but they are designed for non-relationalgraphs MHGRN addresses these issues by rea-soning on multi-relational graphs and being inter-pretable via maintaining paths as reasoning chains
8 Conclusion
We present a principled scalable method MHGRNthat can leverage general knowledge via multi-hopreasoning over interpretable structures (eg Con-ceptNet) The proposed MHGRN generalizes andcombines the advantages of GNNs and path-basedreasoning models It explicitly performs multi-hop
relational reasoning and is empirically shown tooutperform existing methods with superior scal-ablility and interpretability
ReferencesSami Abu-El-Haija Bryan Perozzi Amol Kapoor
Nazanin Alipourfard Kristina Lerman Hrayr Haru-tyunyan Greg Ver Steeg and Aram Galstyan 2019Mixhop Higher-order graph convolutional architec-tures via sparsified neighborhood mixing In Pro-ceedings of the 36th International Conference onMachine Learning ICML 2019 9-15 June 2019Long Beach California USA volume 97 of Pro-ceedings of Machine Learning Research pages 21ndash29 PMLR
Pratyay Banerjee Kuntal Kumar Pal Arindam Mi-tra and Chitta Baral 2019 Careful selection ofknowledge to solve open book question answeringIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages6120ndash6129 Florence Italy Association for Compu-tational Linguistics
Lisa Bauer Yicheng Wang and Mohit Bansal 2018Commonsense for generative multi-hop question an-swering tasks In Proceedings of the 2018 Confer-ence on Empirical Methods in Natural LanguageProcessing Brussels Belgium October 31 - Novem-ber 4 2018 pages 4220ndash4230 Association for Com-putational Linguistics
Peter Clark Isaac Cowhey Oren Etzioni Tushar KhotAshish Sabharwal Carissa Schoenick and OyvindTafjord 2018 Think you have solved questionanswering try arc the AI2 reasoning challengeCoRR abs180305457
Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics
Justin Gilmer Samuel S Schoenholz Patrick F RileyOriol Vinyals and George E Dahl 2017 Neuralmessage passing for quantum chemistry In Pro-ceedings of the 34th International Conference onMachine Learning ICML 2017 Sydney NSW Aus-tralia 6-11 August 2017 volume 70 of Proceedingsof Machine Learning Research pages 1263ndash1272PMLR
Daniel Khashabi Tushar Khot Ashish SabharwalOyvind Tafjord Peter Clark and Hannaneh Ha-jishirzi 2020 Unifiedqa Crossing format bound-aries with a single qa system
Thomas N Kipf and Max Welling 2017 Semi-supervised classification with graph convolutionalnetworks In 5th International Conference on Learn-ing Representations ICLR 2017 Toulon FranceApril 24-26 2017 Conference Track ProceedingsOpenReviewnet
Souvik Kundu Tushar Khot Ashish Sabharwal andPeter Clark 2019 Exploiting explicit paths formulti-hop reading comprehension In Proceedingsof the 57th Annual Meeting of the Association forComputational Linguistics pages 2737ndash2747 Flo-rence Italy Association for Computational Linguis-tics
John D Lafferty Andrew McCallum and FernandoC N Pereira 2001 Conditional random fieldsProbabilistic models for segmenting and labeling se-quence data In Proceedings of the Eighteenth Inter-national Conference on Machine Learning (ICML2001) Williams College Williamstown MA USAJune 28 - July 1 2001 pages 282ndash289 MorganKaufmann
Guokun Lai Qizhe Xie Hanxiao Liu Yiming Yangand Eduard Hovy 2017 RACE Large-scale ReAd-ing comprehension dataset from examinations InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing pages785ndash794 Copenhagen Denmark Association forComputational Linguistics
Zhen-Zhong Lan Mingda Chen Sebastian Good-man Kevin Gimpel Piyush Sharma and RaduSoricut 2019 Albert A lite bert for self-supervised learning of language representationsArXiv abs190911942
Shiyang Li Jianshu Chen and Dian Yu 2019 Teach-ing pretrained models with commonsense reason-ing A preliminary kb-based approach CoRRabs190909743
Yang Li and Peter Clark 2015 Answering elementaryscience questions by constructing coherent scenesusing background knowledge In Proceedings ofthe 2015 Conference on Empirical Methods in Nat-ural Language Processing pages 2007ndash2012 Lis-bon Portugal Association for Computational Lin-guistics
Bill Yuchen Lin Xinyue Chen Jamin Chen and Xi-ang Ren 2019 KagNet Knowledge-aware graphnetworks for commonsense reasoning In Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) pages 2829ndash2839 HongKong China Association for Computational Lin-guistics
Liyuan Liu Haoming Jiang Pengcheng He WeizhuChen Xiaodong Liu Jianfeng Gao and Jiawei Han2019a On the variance of the adaptive learning rateand beyond CoRR abs190803265
Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019bRoberta A robustly optimized BERT pretraining ap-proach CoRR abs190711692
Shangwen Lv Daya Guo Jingjing Xu Duyu TangNan Duan Ming Gong Linjun Shou Daxin JiangGuihong Cao and Songlin Hu 2019 Graph-based reasoning over heterogeneous external knowl-edge for commonsense question answering CoRRabs190905311
Kaixin Ma Jonathan Francis Quanyang Lu Eric Ny-berg and Alessandro Oltramari 2019 Towardsgeneralizable neuro-symbolic systems for common-sense question answering In Proceedings of theFirst Workshop on Commonsense Inference in Natu-ral Language Processing pages 22ndash32 Hong KongChina Association for Computational Linguistics
Todor Mihaylov Peter Clark Tushar Khot and AshishSabharwal 2018 Can a suit of armor conduct elec-tricity a new dataset for open book question an-swering In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processingpages 2381ndash2391 Brussels Belgium Associationfor Computational Linguistics
Todor Mihaylov and Anette Frank 2018 Knowledge-able reader Enhancing cloze-style reading compre-hension with external commonsense knowledge InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 821ndash832 Melbourne AustraliaAssociation for Computational Linguistics
Giannis Nikolentzos George Dasoulas and MichalisVazirgiannis 2019 k-hop graph neural networksCoRR abs190706051
Xiaoman Pan Kai Sun Dian Yu Heng Ji and Dong Yu2019 Improving question answering with externalknowledge CoRR abs190200993
Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2019 Exploring the limitsof transfer learning with a unified text-to-text trans-former
Adam Santoro David Raposo David G T BarrettMateusz Malinowski Razvan Pascanu Peter WBattaglia and Tim Lillicrap 2017 A simple neu-ral network module for relational reasoning In Ad-vances in Neural Information Processing Systems30 Annual Conference on Neural Information Pro-cessing Systems 2017 4-9 December 2017 LongBeach CA USA pages 4967ndash4976
Maarten Sap Hannah Rashkin Derek Chen RonanLe Bras and Yejin Choi 2019 Social IQa Com-monsense reasoning about social interactions InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the
9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) pages 4463ndash4473 Hong Kong China Association for Computa-tional Linguistics
Michael Sejr Schlichtkrull Thomas N Kipf PeterBloem Rianne van den Berg Ivan Titov and MaxWelling 2018 Modeling relational data with graphconvolutional networks In The Semantic Web - 15thInternational Conference ESWC 2018 HeraklionCrete Greece June 3-7 2018 Proceedings volume10843 of Lecture Notes in Computer Science pages593ndash607 Springer
Robyn Speer Joshua Chin and Catherine Havasi 2017Conceptnet 55 An open multilingual graph of gen-eral knowledge In Proceedings of the Thirty-FirstAAAI Conference on Artificial Intelligence Febru-ary 4-9 2017 San Francisco California USApages 4444ndash4451 AAAI Press
Alon Talmor Jonathan Herzig Nicholas Lourie andJonathan Berant 2019 CommonsenseQA A ques-tion answering challenge targeting commonsenseknowledge In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4149ndash4158 Minneapolis Minnesota Associ-ation for Computational Linguistics
Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez LukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems 30 Annual Conference on NeuralInformation Processing Systems 2017 4-9 Decem-ber 2017 Long Beach CA USA pages 5998ndash6008
Petar Velickovic Guillem Cucurull Arantxa CasanovaAdriana Romero Pietro Lio and Yoshua Bengio2018 Graph attention networks In 6th Inter-national Conference on Learning RepresentationsICLR 2018 Vancouver BC Canada April 30 - May3 2018 Conference Track Proceedings OpenRe-viewnet
Shuohang Wang and Jing Jiang 2016 Learning nat-ural language inference with LSTM In Proceed-ings of the 2016 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics Human Language Technologies pages1442ndash1451 San Diego California Association forComputational Linguistics
Xiaoyan Wang Pavan Kapanipathi Ryan MusaMo Yu Kartik Talamadupula Ibrahim AbdelazizMaria Chang Achille Fokoue Bassem MakniNicholas Mattei and Michael Witbrock 2019 Im-proving natural language inference using externalknowledge in the science questions domain In TheThirty-Third AAAI Conference on Artificial Intelli-gence AAAI 2019 The Thirty-First Innovative Ap-plications of Artificial Intelligence Conference IAAI2019 The Ninth AAAI Symposium on Educational
Advances in Artificial Intelligence EAAI 2019 Hon-olulu Hawaii USA January 27 - February 1 2019pages 7208ndash7215 AAAI Press
Dirk Weissenborn Tomas Kocisky and Chris Dyer2017 Dynamic integration of background knowl-edge in neural nlu systems arXiv preprintarXiv170602596
An Yang Quan Wang Jing Liu Kai Liu Yajuan LyuHua Wu Qiaoqiao She and Sujian Li 2019 En-hancing pre-trained language representations withrich knowledge for machine reading comprehensionIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2346ndash2357 Florence Italy Association for Compu-tational Linguistics
Bishan Yang and Tom Mitchell 2017 Leveragingknowledge bases in LSTMs for improving machinereading In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1 Long Papers) pages 1436ndash1446 Van-couver Canada Association for Computational Lin-guistics
Zhi-Xiu Ye Qian Chen Wen Wang and Zhen-HuaLing 2019 Align mask and select A sim-ple method for incorporating commonsense knowl-edge into language representation models CoRRabs190806725
Yuyu Zhang Hanjun Dai Kamil Toraman andLe Song 2018 Kgˆ2 Learning to reason scienceexam questions with contextual knowledge graphembeddings CoRR abs180512393
Chen Zhu Yu Cheng Zhe Gan Siqi Sun Tom Gold-stein and Jingjing Liu 2020 Freelb Enhanced ad-versarial training for natural language understandingIn International Conference on Learning Represen-tations
A Merging Types of Relations inConceptNet
Relation Merged Relation
AtLocationAtLocation
LocatedNear
CausesCausesCausesDesire
MotivatedByGoal
AntonymAntonym
DistinctFrom
HasSubevent
HasSubevent
HasFirstSubeventHasLastSubeventHasPrerequisite
EntailsMannerOf
IsAIsAInstanceOf
DefinedAs
PartOfPartOf
HasA
RelatedToRelatedToSimilarTo
Synonym
Table 7 Relations in ConceptNet that are being mergedin pre-processing RelationX indicates the reverse re-lation of RelationX
We merge relations that are close in semanticsas well as in the general usage of triple instancesin ConceptNet
B Dataset Split Specifications
Train Dev Test
CommonsenseQA (OF) 9 741 1 221 1 140CommonsenseQA (IH) 8 500 1 221 1 241OpenbookQA 4 957 500 500
Table 8 Numbers of instances in different datasetsplits
CommonsenseQA 7 and OpenbookQA 8 all havetheir leaderboards with training and development
7httpswwwtau-nlporgcommonsenseqa8httpsleaderboardallenaiorgopen_
book_qasubmissionspublic
set publicly available As the ground truth labelsfor CommonsenseQA are not readily available formodel analysis we take 1241 examples from of-ficial training examples as our in-house test ex-amples and regard the remaining 8500 ones asour in-house training examples (CommonsenseQA(IH))
C Implementation Details
CommonsenseQA OpenbookQA
BERT-BASE 3 times 10minus5 -
BERT-LARGE 2 times 10minus5 -
ROBERTA-LARGE 1 times 10minus5
1 times 10minus5
Table 9 Learning rate for text encoders on differentdatasets
CommonsenseQA OpenbookQA
RN 3 times 10minus4
3 times 10minus4
RGCN 1 times 10minus3
1 times 10minus3
GconAttn 3 times 10minus4
1 times 10minus4
MHGRN 1 times 10minus3
1 times 10minus3
Table 10 Learning rate for graph encoders on differentdatasets
Param
RN 399KRGCN 365KGconAttn 453KMHGRN 544K
Table 11 Numbers of parameters of different graph en-coders
Our models are implemented in PyTorch We usecross-entropy loss and RAdam (Liu et al 2019a)optimizer We find it beneficial to use separatelearning rates for the text encoder and the graph en-coder We tune learning rates for text encoders andgraph encoders on two datasets We first fine-tuneROBERTA-LARGE BERT-LARGE BERT-BASE
on CommonsenseQA and ROBERTA-LARGE onOpenbookQA respectively and choose a dataset-specific learning rate from 1times10
minus5 2times10
minus5 3times
10minus5 6 times 10
minus5 1 times 10
minus4 for each text encoderbased on the best performance on development setas listed in Table 9 We report the performance ofthese fine-tuned text encoders and also adopt theirdataset-specific optimal learning rates in joint train-ing with graph encoders For models that involve
KG the learning rate of their graph encoders arechosen from 1 times 10
minus4 3 times 10
minus4 1 times 10
minus3 3 times
10minus3 based on their best development set perfor-
mance with ROBERTA-LARGE as the text encoderWe report the optimal learning rates for graph en-coders in Table 10 In training we set the max-imum input sequence length to text encoders to64 batch size to 32 and perform early stoppingAristoRoBERTaV7+MHGRN is the only exceptionIn order to host fair comparison we follow Aris-toRoBERTaV7 and set the batch size to 16 maxinput sequence length to 256 and choose a decoderlearning rate from 1 times 10
minus3 2 times 10
minus5For the input node features we first use tem-
plates to turn knowledge triples in ConceptNet intosentences and feed them into pre-trained BERT-LARGE obtaining a sequence of token embeddingsfrom the last layer of BERT-LARGE for each tripleFor each entity in ConceptNet we perform meanpooling over the tokens of the entityrsquos occurrencesacross all the sentences to form a 1024d vector asits corresponding node feature We use this set offeatures for all our implemented models
We use 2-layer RGCN and single-layer MHGRNacross our experiments
The numbers of parameter for each graph en-coder are listed in Table 11
D Dynamic Programming Algorithm forEq 7
To show that multi-hop message passing can becomputed in linear time we observe that Eq 7 canbe re-written in matrix form
Zk= (Dk)minus1 sum
(r1rk)isinRk
β(r1 rk s)
sdotGArk⋯Ar1FXW1r1
⊤⋯W
krk
⊤
sdotWk+10
⊤⋯W
K0
⊤ (1 le k le K) (13)
where G = diag(exp([g(φ(v1) s) g(φ(vn)s)]) (F is similarly defined) Ar is the adjacencymatrix for relation r and D
k is defined as follows
Dk= diag( sum
(r1rk)isinRk
β(r1 rk s)
sdotGArk⋯Ar1FX1) (1 le k le K) (14)
Using this matrix formulation we can compute Eq7 using dynamic programming
Algorithm 1 Dynamic programming algorithm formulti-hop message passingInput sXAr(1 le r le m)W t
r (r isin R 1 le t le
k)F G δ τOutput Z
1 WKlarr I
2 for k larr K minus 1 to 1 do3 W
klarrW
k+10 W
k+1
4 end for5 for r isin R do6 Mr larr FX7 end for8 for k larr 1 to K do9 if k gt 1 then
10 for r isin R do11 M
primer larr e
δ(rs)ArsumrprimeisinR e
τ(rprimers)MrprimeW
kr
⊤
12 end for13 for r isin R do14 Mr larrM
primer
15 end for16 else17 for r isin R do18 Mr larr e
δ(rs) sdotArMrWkr
⊤
19 end for20 end if21 Z
klarr GsumrisinR MrW
k
22 end for23 Replace W
tr (0 le r le m 1 le t le k) with identity
matrices and X with 1 and re-run line 1 - line 19 tocompute d
1 d
K
24 for k larr 1 to K do25 Z
klarr (diag(dk))minus1
Zk
26 end for27 return Z
1Z
2 Z
k
E Formal Definition of K-hop RN
Definition 1 (K-hop Relation Network) A multi-hop relation network is a function that maps amulti-relational graph to a fixed size vector
KHopRN(G W E H) =K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(j r1 rk i)sdotW (hjoplus(er1⋯erk)oplushi)(15)
where denotes element-wise product and β(⋯) =1(K∣A∣ sdot ∣(j prime i) isin G ∣ j prime isin Q∣) defines thepooling weights
F Expressing K-hop RN with MultiGRN
Theorem 1 Given any W E H there exists a pa-rameter setting such that the output of the modelbecomes KHopRN(GW E H) for arbitrary G
Proof Suppose W = [W1 W2 W3] whereW1 W3 isin Rd3timesd1 W2 isin Rd3timesd2 ForMHRGN we set the parameters as followsH = HUlowast = [I0] isin R(d1+d2)timesd1 blowast =
[01]⊤ isin Rd1+d2 Wtr = diag(1 oplus er) isin
R(d1+d2)times(d1+d2)(r isin R 1 le t le K)V = W3 isin
Rd3timesd1 Vprime= [W1 W2] isin Rd3times(d1+d2) We dis-
able the relation type attention module and enablemessage passing only from Q to A By furtherchoosing σ as the identity function and perform-ing pooling over A we observe that the output of
MultiGRN becomes
1
∣A∣ sumiisinA
hprimei
=1
∣A∣ (V hi + Vprimezi)
=1
K∣A∣K
sumk=1
(V hi + Vprimezki )
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(j r1 rk i)(V hi+
VprimeW
krk⋯W
1r1xj)
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ
β(⋯)(V hi + VprimeW
krk
⋯W1r1Uφ(j)hj + V
primeW
krk⋯W
1r1bφ(j))
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(⋯)(W3hi + W1hj
+ W2(er1 ⋯ erk))
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(⋯)W (hjoplus
(er1 ⋯ erk)oplus hi)= RN(G W E H)
(16)
Model Time Space
G is a dense graph
K-hop KagNet O (mKnK+1
K) O (mKnK+1
K)K-layer RGCN O (mn2
K) O (mnK)MHGRN O (m2
n2K) O (mnK)
G is a sparse graph with maximum node degree ∆ ≪ n
K-hop KagNet O (mKnK∆
K) O (mKnK∆
K)K-layer RGCN O (mnK∆) O (mnK)MHGRN O (m2
nK∆) O (mnK)
Table 2 Computation complexity of different K-hopreasoning models on a densesparse multi-relationalgraph with n nodes and m relation types Despite thequadratic complexity wrt m MHGRNrsquos time cost issimilar to RGCN on GPUs with parallelizable matrixmultiplications (cf Fig 7)
where f(sdot) δ(sdot) and g(sdot) are parameterized bytwo-layer MLPs and τ(sdot) by a transition matrixof shape m timesm Intuitively β(sdot) models the im-portance of a k-hop relation while γ(sdot) models theimportance of messages from node type φ(j) toφ(i) (eg the model can learn to pass messagesonly from question entities to answer entities)
Our model scores a k-hop relation by decom-posing it into both context-aware single-hop re-lations (modeled by δ) and two-hop relations(modeled by τ ) We argue that τ is indis-pensable without which the model may assignhigh importance to illogical multi-hop relations(eg [AtLocation CapableOf]) or noisy re-lations (eg [RelatedTo RelatedTo])
43 Computation Complexity Analysis
Although message passing process in Eq 7 andattention module in Eq11 handles potentially ex-ponential number of paths it can be computed inlinear time with dynamic programming (see Ap-pendix D) As summarized in Table 2 time com-plexity and space complexity of MHGRN on asparse graph are both linear wrt either the maxi-mum path length K or the number of nodes n
44 Expressive Power of MHGRN
In addition to efficiency and scalability we now dis-cuss the modeling capacity of MHGRN With themessage passing formulation and relation-specifictransformations it is by nature the generalizationof RGCN It is also capable of directly modelingpaths making it interpretable as are path-basedmodels like RN and KagNet To show this we
first generalize RN (Eq 3) to the multi-hop set-ting and introduce K-hop RN (formal definition inAppendix E) which models multi-hop relation asthe composition of single-hop relations We showthat MHGRN is capable of representingK-hop RN(proof in Appendix F)
45 Learning Inference and Path DecodingWe now discuss the learning and inference processof MHGRN instantiated for QA tasks Followingthe problem formulation in sect2 we aim to deter-mine the plausibility of an answer option a isin Cgiven the question q with the information fromboth text s and graph G We first obtain the graphrepresentation g by performing attentive poolingover the output node embeddings of answer entitieshprimei ∣ i isin A Next we concatenate it with the textrepresentation s and compute the plausibility scoreby ρ(qa) = MLP(soplus g)
During training we maximize the plausibilityscore of the correct answer a by minimizing thecross-entropy loss
L = EqaC [minus logexp(ρ(q a))
sumaisinC exp(ρ(qa))] (12)
The whole model is trained end-to-end jointly withthe text encoder (eg RoBERTa)
During inference we predict the most plau-sible answer by argmaxaisinC ρ(qa) Addition-ally we can decode a reasoning path as evidencefor model predictions endowing our model withthe interpretability enjoyed by path-based mod-els Specifically we first determine the answerentity ilowast with the highest score in the pooling layerand the path length klowast with the highest score inEq 8 Then the reasoning path is decoded byargmax α(j r1 rklowast i
lowast) which can be com-puted in linear time using dynamic programming
5 Experimental SetupWe introduce how we construct G (sect51) thedatasets (sect52) as well as the baseline methods(sect53) Appendix C shows more implementationand experimental details for reproducibility
51 Extracting G from External KGWe use ConceptNet (Speer et al 2017) a general-domain knowledge graph as our external KG to testmodelsrsquo ability to harness structured knowledgesource Following KagNet (Lin et al 2019) wemerge relation types to increase graph density andadd reverse relations to construct a multi-relational
Methods BERT-Base BERT-Large RoBERTa-Large
IHdev-Acc() IHtest-Acc() IHdev-Acc() IHtest-Acc() IHdev-Acc() IHtest-Acc()
wo KG 5731 (plusmn107) 5347 (plusmn087) 6106 (plusmn085) 5539 (plusmn040) 7307 (plusmn045) 6869(plusmn056)
RGCN (Schlichtkrull et al 2018) 5694 (plusmn038) 5450 (plusmn056) 6298 (plusmn082) 5713 (plusmn036) 7269 (plusmn019) 6841 (plusmn066)GconAttn (Wang et al 2019) 5727 (plusmn070) 5484 (plusmn088) 6317 (plusmn018) 5736 (plusmn090) 7261( plusmn039) 6859 (plusmn096)KagNetdagger (Lin et al 2019) 5557 5619 6235 5716 - -RN (1-hop) 5827 (plusmn022) 5620 (plusmn045) 6304 (plusmn058) 5846 (plusmn071) 7457 (plusmn091) 6908 (plusmn021)RN (2-hop) 5981 (plusmn076) 5661 (plusmn068) 6336 (plusmn026) 5892 (plusmn014) 7365 (plusmn309) 6959 (plusmn380)
MHGRN 6036 (plusmn023) 5723 (plusmn082) 6329(plusmn051) 6059 (plusmn058) 7445 (plusmn010) 7111 (plusmn081)
Table 3 Performance comparison on CommonsenseQA in-house split We report in-house Dev (IHdev) andTest (IHtest) accuracy (mean and standard deviation of four runs) using the data split of Lin et al (2019) onCommonsenseQA dagger indicates reported results in its paper
Methods Single Ensemble
UnifiedQAdagger (Khashabi et al 2020) 791 -
RoBERTadagger 721 725RoBERTa + KEDGNdagger 725 744RoBERTa + KEdagger 733 -RoBERTa + HyKAS 20dagger (Ma et al 2019) 732 -RoBERTa + FreeLBdagger(Zhu et al 2020) 722 731XLNet + DREAMdagger 669 733XLNet + GRdagger (Lv et al 2019) 753 -ALBERTdagger (Lan et al 2019) - 765
RoBERTa + MHGRN (K = 2) 754 765
Table 4 Performance comparison on official test ofCommonsenseQA with leaderboard SoTAs4 (accuracyin ) dagger indicates reported results on leaderboard Uni-fiedQA uses T5-11B as text encoder whose number ofparameters is about 30 times more than other models
graph with 34 relation types (details in AppendixA) To extract an informative contextualized graphG from KG we recognize entity mentions in s andlink them to entities in ConceptNet with which weinitialize our node set V We then add to V all theentities that appear in any two-hop paths betweenpairs of mentioned entities Unlike KagNet we donot perform any pruning but instead reserve all theedges between nodes in V forming our G
52 Datasets
We evaluate models on two multiple-choice ques-tion answering datasets CommonsenseQA andOpenbookQA Both require world knowledge be-yond textual understanding to perform well
CommonsenseQA (Talmor et al 2019) neces-sitates various commonsense reasoning skills Thequestions are created with entities from ConceptNetand they are designed to probe latent compositionalrelations between entities in ConceptNet
OpenBookQA (Mihaylov et al 2018) provides
4Models based on ConceptNet are no longer shown on theleaderboard and we got our results from the organizers
elementary science questions together with an openbook of science facts This dataset also probesgeneral common sense beyond the provided facts
53 Compared Methods
We implement both knowledge-agnostic fine-tuning of pre-trained LMs and models that incor-porate KG as external sources as our baselinesAdditionally we directly compare our model withthe results from corresponding leaderboard Thesemethods typically leverage textual knowledge orextra training data as opposed to external KG Inall our implemented models we use pre-trainedLMs as text encoders for s for fair comparisonWe do compare our models with those (Ma et al2019 Lv et al 2019 Khashabi et al 2020) aug-mented by other text-form external knowledge (egWikipedia) although we stick to our focus of en-coding structured KG
Specifically we fine-tune BERT-BASE BERT-LARGE (Devlin et al 2019) and ROBERTA (Liuet al 2019b) for multiple-choice questions Wetake RGCN (Eq 2 in sect3) RN5 (Eq 3 in sect3)KagNet (Eq 4 in sect3) and GconAttn (Wanget al 2019) as baselines GconAttn generalizesmatch-LSTM (Wang and Jiang 2016) and achievessuccess in language inference tasks
6 Results and Discussions
In this section we present the results of our modelsin comparison with baselines as well as methodson the leaderboards for both CommonsenseQA andOpenbookQA We also provide analysis of modelsrsquocomponents and characteristics
5We use mean pooling for 1-hop RN and attentive poolingfor 2-hop RN (detailed in Appendix E)
Methods Dev () Test ()
T5-3Bdagger (Raffel et al 2019) - 8320UnifiedQAdagger (Khashabi et al 2020) - 8720
RoBERTa-Large (wo KG) 6676 (plusmn114) 6480 (plusmn237)+ RGCN 6465 (plusmn196) 6245 (plusmn157)+ GconAttn 6430 (plusmn099) 6190 (plusmn244)+ RN (1-hop) 6485 (plusmn111) 6365 (plusmn231)+ RN (2-hop) 6700 (plusmn071) 6520 (plusmn118)+ MHGRN (K = 3) 6810 (plusmn102) 6685 (plusmn119)
AristoRoBERTaV7dagger 792 778+ MHGRN (K = 3) 786 806
Table 5 Performance comparison on OpenbookQAdagger indicates reported results on leaderboard T5-3B is 8times larger than our models UnifiedQA is 30x larger
61 Main ResultsFor CommonsenseQA (Table 3) we first use thein-house data split (IH) of Lin et al (2019) (cfAppendix B) to compare our models with imple-mented baselines This is different from the offi-cial split used in the leaderboard methods Almostall KG-augmented models achieve performancegain over vanilla pre-trained LMs demonstratingthe value of external knowledge on this dataset Ad-ditionally we evaluate our MHGRN (with text en-coder being ROBERTA-LARGE) on official splitOF (Table 4) for fair comparison with other meth-ods on leaderboard in both single-model settingand ensemble-model setting With backbone be-ing T5 (Raffel et al 2019) UnifiedQA (Khashabiet al 2020) tops the leaderboard Consideringits training cost we do not intend to compare ourROBERTA-based model with it We achieve thebest performances among other models
For OpenbookQA (Table 5) we use official splitand build models with ROBERTA-LARGE as textencoder MHGRN surpasses all implemented base-lines with an absolute increase of sim2 on TestAlso as our approach is naturally compatible withthe methods that utilize textual knowledge or ex-tra data because in our paradigm the encodingof textual statement and graph are structurally-decoupled (Fig 3) To empirically show MHGRNcan bring gain over textual-knowledge empow-ered systems we replace our text encoder withAristoRoBERTaV76 and fine-tune our MHGRNupon OpenbookQA Empirically MHGRN contin-ues to bring benefit to strong-performing textual-knowledge empowered systems This indicatestextual knowledge and structured knowledge canpotentially be complementary
6httpsleaderboardallenaiorgopen_book_qasubmissionblcp1tu91i4gm0vf484g
Methods IHdev-Acc ()
MHGRN (K = 3) 7445 (plusmn010)- Type-specific transformation (sect41) 7316 (plusmn028)- Structured relational attention (sect42) 7326 (plusmn031)- Relation type attention (sect42) 7355 (plusmn068)- Node type attention (sect42) 7392 (plusmn065)
Table 6 Ablation study on model components (re-moving one component each time) using ROBERTA-LARGE as the text encoder We report the IHdev ac-curacy on CommonsenseQA
62 Performance Analysis
Ablation Study on Model Components Asshown in Table 6 disabling type-specific trans-formation results in sim 13 drop in performancedemonstrating the need for distinguishing nodetypes for QA tasks Our structured relational at-tention mechanism is also critical with its twosub-components contributing almost equally
Impact of the Amount of Training Data Weuse different fractions of training data of Common-senseQA and report results of fine-tuning text en-coders alone and jointly training text encoder andgraph encoder in Fig 5 Regardless of training datafraction our model shows consistently more per-formance improvement over knowledge-agnosticfine-tuning compared with the other graph encod-ing methods indicating MHGRNrsquos complementarystrengths to text encoders
20 40 60 80 100Percentage of Training Data
45
50
55
60
65
Accu
racy
()
MultiGRNKVMGconAttnwo KG
Figure 5 Performance change (accuracy in ) wrtthe amounts of training data on CommonsenseQA IHT-est set (same as Lin et al (2019))
Impact of Number of Hops (K) We investigatethe impact of hyperparameter K for MHGRN byits performance on CommonsenseQA (Fig 6) Theincrease ofK continues to bring benefits untilK =
4 However performance begins to drop whenK gt 3 This might be attributed to exponentialnoise in longer relational paths in knowledge graph
65 70 75Accuracy ()
K=1
K=2
K=3
K=4
K=5
Reasoning Hops
7324
7393
7465
7365
7416
Figure 6 Effect ofK in MHGRN We show IHDev ac-curacy of MHGRN on CommonsenseQA wrt hops
2 4 6 800
05
10
15
20
sba
tch
K Reasoning Hops
RGCNMHGRN
Figure 7 Analysis of model scalability Comparisonof per-batch training efficiency wrt hops K
63 Model ScalabilityFig 7 presents the computation cost of MHGRNand RGCN (measured by training time) Both growlinearly wrt K Although the theoretical complex-ity of MultiRGN ism times that of RGCN the ratioof their empirical cost only approaches 2 demon-strating that our model can be better parallelized
64 Model InterpretabilityWe can analyze our modelrsquos reasoning process bydecoding the reasoning path using the method de-scribed in sect45 Fig 8 shows two examples fromCommonsenseQA where our model correctly an-swers the questions and provides reasonable pathevidences In the example on the left the modellinks question entities and answer entity in a chainto support reasoning while the example on the rightprovides a case where our model leverage unmen-tioned entities to bridge the reasoning gap betweenquestion entity and answer entities in a way that iscoherent with the latent relation between CHAPEL
and the desired answer in the question
7 Related WorkKnowledge-Aware Methods for NLP Variouswork have investigated the potential to empowerNLP models with external knowledge Many at-tempt to extract structured knowledge either in theform of nodes (Yang and Mitchell 2017 Wanget al 2019) triples (Weissenborn et al 2017
Where is known for a multitude of wedding chapels
A town B texasC city D church building E Nevada
RenoChapel
PartOf
AtLocation
UsedForminus1 Synonym
AtLocation Synonym
Neveda
Baseball
Play_Baseball
HasProperty
UsedForminus1
Fun_to_Play
Why do parents encourage their kids to play baseball
A round B cheap C break window D hardE fun to play Fun_to_Play
HasProperty
Reno
Chapel
AtLocation
Neveda
PartOf
MultiGRNFigure 8 Case study on model interpretability Wepresent two sample questions from CommonsenseQAwith the reasoning paths output by MHGRN
Mihaylov and Frank 2018) paths (Bauer et al2018 Kundu et al 2019 Lin et al 2019) or sub-graphs (Li and Clark 2015) and encode them toaugment textual understanding
Recent success of pre-trained LMs motivatesmany (Pan et al 2019 Ye et al 2019 Zhang et al2018 Li et al 2019 Banerjee et al 2019) to probeLMsrsquo potential as latent knowledge bases This lineof work turn to textual knowledge (eg Wikipedia)to directly impart knowledge to pre-trained LMsThey generally fall into two paradigms 1) Fine-tuning LMs on large-scale general-domain datasets(eg RACE (Lai et al 2017)) or on knowledge-richtext 2) Providing LMs with evidence via informa-tion retrieval techniques However these modelscannot provide explicit reasoning and evidencethus hardly trustworthy They are also subject tothe availability of in-domain datasets and maxi-mum input token of pre-trained LMs
Neural Graph Encoding Graph Attention Net-works (GAT) (Velickovic et al 2018) incorpo-rates attention mechanism in feature aggregationRGCN (Schlichtkrull et al 2018) proposes rela-tional message passing which makes it applicableto multi-relational graphs However they only per-form single-hop message passing and cannot beinterpreted at path level Other work (Abu-El-Haijaet al 2019 Nikolentzos et al 2019) aggregate fora node its K-hop neighbors based on node-wisedistances but they are designed for non-relationalgraphs MHGRN addresses these issues by rea-soning on multi-relational graphs and being inter-pretable via maintaining paths as reasoning chains
8 Conclusion
We present a principled scalable method MHGRNthat can leverage general knowledge via multi-hopreasoning over interpretable structures (eg Con-ceptNet) The proposed MHGRN generalizes andcombines the advantages of GNNs and path-basedreasoning models It explicitly performs multi-hop
relational reasoning and is empirically shown tooutperform existing methods with superior scal-ablility and interpretability
ReferencesSami Abu-El-Haija Bryan Perozzi Amol Kapoor
Nazanin Alipourfard Kristina Lerman Hrayr Haru-tyunyan Greg Ver Steeg and Aram Galstyan 2019Mixhop Higher-order graph convolutional architec-tures via sparsified neighborhood mixing In Pro-ceedings of the 36th International Conference onMachine Learning ICML 2019 9-15 June 2019Long Beach California USA volume 97 of Pro-ceedings of Machine Learning Research pages 21ndash29 PMLR
Pratyay Banerjee Kuntal Kumar Pal Arindam Mi-tra and Chitta Baral 2019 Careful selection ofknowledge to solve open book question answeringIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages6120ndash6129 Florence Italy Association for Compu-tational Linguistics
Lisa Bauer Yicheng Wang and Mohit Bansal 2018Commonsense for generative multi-hop question an-swering tasks In Proceedings of the 2018 Confer-ence on Empirical Methods in Natural LanguageProcessing Brussels Belgium October 31 - Novem-ber 4 2018 pages 4220ndash4230 Association for Com-putational Linguistics
Peter Clark Isaac Cowhey Oren Etzioni Tushar KhotAshish Sabharwal Carissa Schoenick and OyvindTafjord 2018 Think you have solved questionanswering try arc the AI2 reasoning challengeCoRR abs180305457
Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics
Justin Gilmer Samuel S Schoenholz Patrick F RileyOriol Vinyals and George E Dahl 2017 Neuralmessage passing for quantum chemistry In Pro-ceedings of the 34th International Conference onMachine Learning ICML 2017 Sydney NSW Aus-tralia 6-11 August 2017 volume 70 of Proceedingsof Machine Learning Research pages 1263ndash1272PMLR
Daniel Khashabi Tushar Khot Ashish SabharwalOyvind Tafjord Peter Clark and Hannaneh Ha-jishirzi 2020 Unifiedqa Crossing format bound-aries with a single qa system
Thomas N Kipf and Max Welling 2017 Semi-supervised classification with graph convolutionalnetworks In 5th International Conference on Learn-ing Representations ICLR 2017 Toulon FranceApril 24-26 2017 Conference Track ProceedingsOpenReviewnet
Souvik Kundu Tushar Khot Ashish Sabharwal andPeter Clark 2019 Exploiting explicit paths formulti-hop reading comprehension In Proceedingsof the 57th Annual Meeting of the Association forComputational Linguistics pages 2737ndash2747 Flo-rence Italy Association for Computational Linguis-tics
John D Lafferty Andrew McCallum and FernandoC N Pereira 2001 Conditional random fieldsProbabilistic models for segmenting and labeling se-quence data In Proceedings of the Eighteenth Inter-national Conference on Machine Learning (ICML2001) Williams College Williamstown MA USAJune 28 - July 1 2001 pages 282ndash289 MorganKaufmann
Guokun Lai Qizhe Xie Hanxiao Liu Yiming Yangand Eduard Hovy 2017 RACE Large-scale ReAd-ing comprehension dataset from examinations InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing pages785ndash794 Copenhagen Denmark Association forComputational Linguistics
Zhen-Zhong Lan Mingda Chen Sebastian Good-man Kevin Gimpel Piyush Sharma and RaduSoricut 2019 Albert A lite bert for self-supervised learning of language representationsArXiv abs190911942
Shiyang Li Jianshu Chen and Dian Yu 2019 Teach-ing pretrained models with commonsense reason-ing A preliminary kb-based approach CoRRabs190909743
Yang Li and Peter Clark 2015 Answering elementaryscience questions by constructing coherent scenesusing background knowledge In Proceedings ofthe 2015 Conference on Empirical Methods in Nat-ural Language Processing pages 2007ndash2012 Lis-bon Portugal Association for Computational Lin-guistics
Bill Yuchen Lin Xinyue Chen Jamin Chen and Xi-ang Ren 2019 KagNet Knowledge-aware graphnetworks for commonsense reasoning In Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) pages 2829ndash2839 HongKong China Association for Computational Lin-guistics
Liyuan Liu Haoming Jiang Pengcheng He WeizhuChen Xiaodong Liu Jianfeng Gao and Jiawei Han2019a On the variance of the adaptive learning rateand beyond CoRR abs190803265
Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019bRoberta A robustly optimized BERT pretraining ap-proach CoRR abs190711692
Shangwen Lv Daya Guo Jingjing Xu Duyu TangNan Duan Ming Gong Linjun Shou Daxin JiangGuihong Cao and Songlin Hu 2019 Graph-based reasoning over heterogeneous external knowl-edge for commonsense question answering CoRRabs190905311
Kaixin Ma Jonathan Francis Quanyang Lu Eric Ny-berg and Alessandro Oltramari 2019 Towardsgeneralizable neuro-symbolic systems for common-sense question answering In Proceedings of theFirst Workshop on Commonsense Inference in Natu-ral Language Processing pages 22ndash32 Hong KongChina Association for Computational Linguistics
Todor Mihaylov Peter Clark Tushar Khot and AshishSabharwal 2018 Can a suit of armor conduct elec-tricity a new dataset for open book question an-swering In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processingpages 2381ndash2391 Brussels Belgium Associationfor Computational Linguistics
Todor Mihaylov and Anette Frank 2018 Knowledge-able reader Enhancing cloze-style reading compre-hension with external commonsense knowledge InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 821ndash832 Melbourne AustraliaAssociation for Computational Linguistics
Giannis Nikolentzos George Dasoulas and MichalisVazirgiannis 2019 k-hop graph neural networksCoRR abs190706051
Xiaoman Pan Kai Sun Dian Yu Heng Ji and Dong Yu2019 Improving question answering with externalknowledge CoRR abs190200993
Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2019 Exploring the limitsof transfer learning with a unified text-to-text trans-former
Adam Santoro David Raposo David G T BarrettMateusz Malinowski Razvan Pascanu Peter WBattaglia and Tim Lillicrap 2017 A simple neu-ral network module for relational reasoning In Ad-vances in Neural Information Processing Systems30 Annual Conference on Neural Information Pro-cessing Systems 2017 4-9 December 2017 LongBeach CA USA pages 4967ndash4976
Maarten Sap Hannah Rashkin Derek Chen RonanLe Bras and Yejin Choi 2019 Social IQa Com-monsense reasoning about social interactions InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the
9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) pages 4463ndash4473 Hong Kong China Association for Computa-tional Linguistics
Michael Sejr Schlichtkrull Thomas N Kipf PeterBloem Rianne van den Berg Ivan Titov and MaxWelling 2018 Modeling relational data with graphconvolutional networks In The Semantic Web - 15thInternational Conference ESWC 2018 HeraklionCrete Greece June 3-7 2018 Proceedings volume10843 of Lecture Notes in Computer Science pages593ndash607 Springer
Robyn Speer Joshua Chin and Catherine Havasi 2017Conceptnet 55 An open multilingual graph of gen-eral knowledge In Proceedings of the Thirty-FirstAAAI Conference on Artificial Intelligence Febru-ary 4-9 2017 San Francisco California USApages 4444ndash4451 AAAI Press
Alon Talmor Jonathan Herzig Nicholas Lourie andJonathan Berant 2019 CommonsenseQA A ques-tion answering challenge targeting commonsenseknowledge In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4149ndash4158 Minneapolis Minnesota Associ-ation for Computational Linguistics
Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez LukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems 30 Annual Conference on NeuralInformation Processing Systems 2017 4-9 Decem-ber 2017 Long Beach CA USA pages 5998ndash6008
Petar Velickovic Guillem Cucurull Arantxa CasanovaAdriana Romero Pietro Lio and Yoshua Bengio2018 Graph attention networks In 6th Inter-national Conference on Learning RepresentationsICLR 2018 Vancouver BC Canada April 30 - May3 2018 Conference Track Proceedings OpenRe-viewnet
Shuohang Wang and Jing Jiang 2016 Learning nat-ural language inference with LSTM In Proceed-ings of the 2016 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics Human Language Technologies pages1442ndash1451 San Diego California Association forComputational Linguistics
Xiaoyan Wang Pavan Kapanipathi Ryan MusaMo Yu Kartik Talamadupula Ibrahim AbdelazizMaria Chang Achille Fokoue Bassem MakniNicholas Mattei and Michael Witbrock 2019 Im-proving natural language inference using externalknowledge in the science questions domain In TheThirty-Third AAAI Conference on Artificial Intelli-gence AAAI 2019 The Thirty-First Innovative Ap-plications of Artificial Intelligence Conference IAAI2019 The Ninth AAAI Symposium on Educational
Advances in Artificial Intelligence EAAI 2019 Hon-olulu Hawaii USA January 27 - February 1 2019pages 7208ndash7215 AAAI Press
Dirk Weissenborn Tomas Kocisky and Chris Dyer2017 Dynamic integration of background knowl-edge in neural nlu systems arXiv preprintarXiv170602596
An Yang Quan Wang Jing Liu Kai Liu Yajuan LyuHua Wu Qiaoqiao She and Sujian Li 2019 En-hancing pre-trained language representations withrich knowledge for machine reading comprehensionIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2346ndash2357 Florence Italy Association for Compu-tational Linguistics
Bishan Yang and Tom Mitchell 2017 Leveragingknowledge bases in LSTMs for improving machinereading In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1 Long Papers) pages 1436ndash1446 Van-couver Canada Association for Computational Lin-guistics
Zhi-Xiu Ye Qian Chen Wen Wang and Zhen-HuaLing 2019 Align mask and select A sim-ple method for incorporating commonsense knowl-edge into language representation models CoRRabs190806725
Yuyu Zhang Hanjun Dai Kamil Toraman andLe Song 2018 Kgˆ2 Learning to reason scienceexam questions with contextual knowledge graphembeddings CoRR abs180512393
Chen Zhu Yu Cheng Zhe Gan Siqi Sun Tom Gold-stein and Jingjing Liu 2020 Freelb Enhanced ad-versarial training for natural language understandingIn International Conference on Learning Represen-tations
A Merging Types of Relations inConceptNet
Relation Merged Relation
AtLocationAtLocation
LocatedNear
CausesCausesCausesDesire
MotivatedByGoal
AntonymAntonym
DistinctFrom
HasSubevent
HasSubevent
HasFirstSubeventHasLastSubeventHasPrerequisite
EntailsMannerOf
IsAIsAInstanceOf
DefinedAs
PartOfPartOf
HasA
RelatedToRelatedToSimilarTo
Synonym
Table 7 Relations in ConceptNet that are being mergedin pre-processing RelationX indicates the reverse re-lation of RelationX
We merge relations that are close in semanticsas well as in the general usage of triple instancesin ConceptNet
B Dataset Split Specifications
Train Dev Test
CommonsenseQA (OF) 9 741 1 221 1 140CommonsenseQA (IH) 8 500 1 221 1 241OpenbookQA 4 957 500 500
Table 8 Numbers of instances in different datasetsplits
CommonsenseQA 7 and OpenbookQA 8 all havetheir leaderboards with training and development
7httpswwwtau-nlporgcommonsenseqa8httpsleaderboardallenaiorgopen_
book_qasubmissionspublic
set publicly available As the ground truth labelsfor CommonsenseQA are not readily available formodel analysis we take 1241 examples from of-ficial training examples as our in-house test ex-amples and regard the remaining 8500 ones asour in-house training examples (CommonsenseQA(IH))
C Implementation Details
CommonsenseQA OpenbookQA
BERT-BASE 3 times 10minus5 -
BERT-LARGE 2 times 10minus5 -
ROBERTA-LARGE 1 times 10minus5
1 times 10minus5
Table 9 Learning rate for text encoders on differentdatasets
CommonsenseQA OpenbookQA
RN 3 times 10minus4
3 times 10minus4
RGCN 1 times 10minus3
1 times 10minus3
GconAttn 3 times 10minus4
1 times 10minus4
MHGRN 1 times 10minus3
1 times 10minus3
Table 10 Learning rate for graph encoders on differentdatasets
Param
RN 399KRGCN 365KGconAttn 453KMHGRN 544K
Table 11 Numbers of parameters of different graph en-coders
Our models are implemented in PyTorch We usecross-entropy loss and RAdam (Liu et al 2019a)optimizer We find it beneficial to use separatelearning rates for the text encoder and the graph en-coder We tune learning rates for text encoders andgraph encoders on two datasets We first fine-tuneROBERTA-LARGE BERT-LARGE BERT-BASE
on CommonsenseQA and ROBERTA-LARGE onOpenbookQA respectively and choose a dataset-specific learning rate from 1times10
minus5 2times10
minus5 3times
10minus5 6 times 10
minus5 1 times 10
minus4 for each text encoderbased on the best performance on development setas listed in Table 9 We report the performance ofthese fine-tuned text encoders and also adopt theirdataset-specific optimal learning rates in joint train-ing with graph encoders For models that involve
KG the learning rate of their graph encoders arechosen from 1 times 10
minus4 3 times 10
minus4 1 times 10
minus3 3 times
10minus3 based on their best development set perfor-
mance with ROBERTA-LARGE as the text encoderWe report the optimal learning rates for graph en-coders in Table 10 In training we set the max-imum input sequence length to text encoders to64 batch size to 32 and perform early stoppingAristoRoBERTaV7+MHGRN is the only exceptionIn order to host fair comparison we follow Aris-toRoBERTaV7 and set the batch size to 16 maxinput sequence length to 256 and choose a decoderlearning rate from 1 times 10
minus3 2 times 10
minus5For the input node features we first use tem-
plates to turn knowledge triples in ConceptNet intosentences and feed them into pre-trained BERT-LARGE obtaining a sequence of token embeddingsfrom the last layer of BERT-LARGE for each tripleFor each entity in ConceptNet we perform meanpooling over the tokens of the entityrsquos occurrencesacross all the sentences to form a 1024d vector asits corresponding node feature We use this set offeatures for all our implemented models
We use 2-layer RGCN and single-layer MHGRNacross our experiments
The numbers of parameter for each graph en-coder are listed in Table 11
D Dynamic Programming Algorithm forEq 7
To show that multi-hop message passing can becomputed in linear time we observe that Eq 7 canbe re-written in matrix form
Zk= (Dk)minus1 sum
(r1rk)isinRk
β(r1 rk s)
sdotGArk⋯Ar1FXW1r1
⊤⋯W
krk
⊤
sdotWk+10
⊤⋯W
K0
⊤ (1 le k le K) (13)
where G = diag(exp([g(φ(v1) s) g(φ(vn)s)]) (F is similarly defined) Ar is the adjacencymatrix for relation r and D
k is defined as follows
Dk= diag( sum
(r1rk)isinRk
β(r1 rk s)
sdotGArk⋯Ar1FX1) (1 le k le K) (14)
Using this matrix formulation we can compute Eq7 using dynamic programming
Algorithm 1 Dynamic programming algorithm formulti-hop message passingInput sXAr(1 le r le m)W t
r (r isin R 1 le t le
k)F G δ τOutput Z
1 WKlarr I
2 for k larr K minus 1 to 1 do3 W
klarrW
k+10 W
k+1
4 end for5 for r isin R do6 Mr larr FX7 end for8 for k larr 1 to K do9 if k gt 1 then
10 for r isin R do11 M
primer larr e
δ(rs)ArsumrprimeisinR e
τ(rprimers)MrprimeW
kr
⊤
12 end for13 for r isin R do14 Mr larrM
primer
15 end for16 else17 for r isin R do18 Mr larr e
δ(rs) sdotArMrWkr
⊤
19 end for20 end if21 Z
klarr GsumrisinR MrW
k
22 end for23 Replace W
tr (0 le r le m 1 le t le k) with identity
matrices and X with 1 and re-run line 1 - line 19 tocompute d
1 d
K
24 for k larr 1 to K do25 Z
klarr (diag(dk))minus1
Zk
26 end for27 return Z
1Z
2 Z
k
E Formal Definition of K-hop RN
Definition 1 (K-hop Relation Network) A multi-hop relation network is a function that maps amulti-relational graph to a fixed size vector
KHopRN(G W E H) =K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(j r1 rk i)sdotW (hjoplus(er1⋯erk)oplushi)(15)
where denotes element-wise product and β(⋯) =1(K∣A∣ sdot ∣(j prime i) isin G ∣ j prime isin Q∣) defines thepooling weights
F Expressing K-hop RN with MultiGRN
Theorem 1 Given any W E H there exists a pa-rameter setting such that the output of the modelbecomes KHopRN(GW E H) for arbitrary G
Proof Suppose W = [W1 W2 W3] whereW1 W3 isin Rd3timesd1 W2 isin Rd3timesd2 ForMHRGN we set the parameters as followsH = HUlowast = [I0] isin R(d1+d2)timesd1 blowast =
[01]⊤ isin Rd1+d2 Wtr = diag(1 oplus er) isin
R(d1+d2)times(d1+d2)(r isin R 1 le t le K)V = W3 isin
Rd3timesd1 Vprime= [W1 W2] isin Rd3times(d1+d2) We dis-
able the relation type attention module and enablemessage passing only from Q to A By furtherchoosing σ as the identity function and perform-ing pooling over A we observe that the output of
MultiGRN becomes
1
∣A∣ sumiisinA
hprimei
=1
∣A∣ (V hi + Vprimezi)
=1
K∣A∣K
sumk=1
(V hi + Vprimezki )
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(j r1 rk i)(V hi+
VprimeW
krk⋯W
1r1xj)
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ
β(⋯)(V hi + VprimeW
krk
⋯W1r1Uφ(j)hj + V
primeW
krk⋯W
1r1bφ(j))
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(⋯)(W3hi + W1hj
+ W2(er1 ⋯ erk))
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(⋯)W (hjoplus
(er1 ⋯ erk)oplus hi)= RN(G W E H)
(16)
Methods BERT-Base BERT-Large RoBERTa-Large
IHdev-Acc() IHtest-Acc() IHdev-Acc() IHtest-Acc() IHdev-Acc() IHtest-Acc()
wo KG 5731 (plusmn107) 5347 (plusmn087) 6106 (plusmn085) 5539 (plusmn040) 7307 (plusmn045) 6869(plusmn056)
RGCN (Schlichtkrull et al 2018) 5694 (plusmn038) 5450 (plusmn056) 6298 (plusmn082) 5713 (plusmn036) 7269 (plusmn019) 6841 (plusmn066)GconAttn (Wang et al 2019) 5727 (plusmn070) 5484 (plusmn088) 6317 (plusmn018) 5736 (plusmn090) 7261( plusmn039) 6859 (plusmn096)KagNetdagger (Lin et al 2019) 5557 5619 6235 5716 - -RN (1-hop) 5827 (plusmn022) 5620 (plusmn045) 6304 (plusmn058) 5846 (plusmn071) 7457 (plusmn091) 6908 (plusmn021)RN (2-hop) 5981 (plusmn076) 5661 (plusmn068) 6336 (plusmn026) 5892 (plusmn014) 7365 (plusmn309) 6959 (plusmn380)
MHGRN 6036 (plusmn023) 5723 (plusmn082) 6329(plusmn051) 6059 (plusmn058) 7445 (plusmn010) 7111 (plusmn081)
Table 3 Performance comparison on CommonsenseQA in-house split We report in-house Dev (IHdev) andTest (IHtest) accuracy (mean and standard deviation of four runs) using the data split of Lin et al (2019) onCommonsenseQA dagger indicates reported results in its paper
Methods Single Ensemble
UnifiedQAdagger (Khashabi et al 2020) 791 -
RoBERTadagger 721 725RoBERTa + KEDGNdagger 725 744RoBERTa + KEdagger 733 -RoBERTa + HyKAS 20dagger (Ma et al 2019) 732 -RoBERTa + FreeLBdagger(Zhu et al 2020) 722 731XLNet + DREAMdagger 669 733XLNet + GRdagger (Lv et al 2019) 753 -ALBERTdagger (Lan et al 2019) - 765
RoBERTa + MHGRN (K = 2) 754 765
Table 4 Performance comparison on official test ofCommonsenseQA with leaderboard SoTAs4 (accuracyin ) dagger indicates reported results on leaderboard Uni-fiedQA uses T5-11B as text encoder whose number ofparameters is about 30 times more than other models
graph with 34 relation types (details in AppendixA) To extract an informative contextualized graphG from KG we recognize entity mentions in s andlink them to entities in ConceptNet with which weinitialize our node set V We then add to V all theentities that appear in any two-hop paths betweenpairs of mentioned entities Unlike KagNet we donot perform any pruning but instead reserve all theedges between nodes in V forming our G
52 Datasets
We evaluate models on two multiple-choice ques-tion answering datasets CommonsenseQA andOpenbookQA Both require world knowledge be-yond textual understanding to perform well
CommonsenseQA (Talmor et al 2019) neces-sitates various commonsense reasoning skills Thequestions are created with entities from ConceptNetand they are designed to probe latent compositionalrelations between entities in ConceptNet
OpenBookQA (Mihaylov et al 2018) provides
4Models based on ConceptNet are no longer shown on theleaderboard and we got our results from the organizers
elementary science questions together with an openbook of science facts This dataset also probesgeneral common sense beyond the provided facts
53 Compared Methods
We implement both knowledge-agnostic fine-tuning of pre-trained LMs and models that incor-porate KG as external sources as our baselinesAdditionally we directly compare our model withthe results from corresponding leaderboard Thesemethods typically leverage textual knowledge orextra training data as opposed to external KG Inall our implemented models we use pre-trainedLMs as text encoders for s for fair comparisonWe do compare our models with those (Ma et al2019 Lv et al 2019 Khashabi et al 2020) aug-mented by other text-form external knowledge (egWikipedia) although we stick to our focus of en-coding structured KG
Specifically we fine-tune BERT-BASE BERT-LARGE (Devlin et al 2019) and ROBERTA (Liuet al 2019b) for multiple-choice questions Wetake RGCN (Eq 2 in sect3) RN5 (Eq 3 in sect3)KagNet (Eq 4 in sect3) and GconAttn (Wanget al 2019) as baselines GconAttn generalizesmatch-LSTM (Wang and Jiang 2016) and achievessuccess in language inference tasks
6 Results and Discussions
In this section we present the results of our modelsin comparison with baselines as well as methodson the leaderboards for both CommonsenseQA andOpenbookQA We also provide analysis of modelsrsquocomponents and characteristics
5We use mean pooling for 1-hop RN and attentive poolingfor 2-hop RN (detailed in Appendix E)
Methods Dev () Test ()
T5-3Bdagger (Raffel et al 2019) - 8320UnifiedQAdagger (Khashabi et al 2020) - 8720
RoBERTa-Large (wo KG) 6676 (plusmn114) 6480 (plusmn237)+ RGCN 6465 (plusmn196) 6245 (plusmn157)+ GconAttn 6430 (plusmn099) 6190 (plusmn244)+ RN (1-hop) 6485 (plusmn111) 6365 (plusmn231)+ RN (2-hop) 6700 (plusmn071) 6520 (plusmn118)+ MHGRN (K = 3) 6810 (plusmn102) 6685 (plusmn119)
AristoRoBERTaV7dagger 792 778+ MHGRN (K = 3) 786 806
Table 5 Performance comparison on OpenbookQAdagger indicates reported results on leaderboard T5-3B is 8times larger than our models UnifiedQA is 30x larger
61 Main ResultsFor CommonsenseQA (Table 3) we first use thein-house data split (IH) of Lin et al (2019) (cfAppendix B) to compare our models with imple-mented baselines This is different from the offi-cial split used in the leaderboard methods Almostall KG-augmented models achieve performancegain over vanilla pre-trained LMs demonstratingthe value of external knowledge on this dataset Ad-ditionally we evaluate our MHGRN (with text en-coder being ROBERTA-LARGE) on official splitOF (Table 4) for fair comparison with other meth-ods on leaderboard in both single-model settingand ensemble-model setting With backbone be-ing T5 (Raffel et al 2019) UnifiedQA (Khashabiet al 2020) tops the leaderboard Consideringits training cost we do not intend to compare ourROBERTA-based model with it We achieve thebest performances among other models
For OpenbookQA (Table 5) we use official splitand build models with ROBERTA-LARGE as textencoder MHGRN surpasses all implemented base-lines with an absolute increase of sim2 on TestAlso as our approach is naturally compatible withthe methods that utilize textual knowledge or ex-tra data because in our paradigm the encodingof textual statement and graph are structurally-decoupled (Fig 3) To empirically show MHGRNcan bring gain over textual-knowledge empow-ered systems we replace our text encoder withAristoRoBERTaV76 and fine-tune our MHGRNupon OpenbookQA Empirically MHGRN contin-ues to bring benefit to strong-performing textual-knowledge empowered systems This indicatestextual knowledge and structured knowledge canpotentially be complementary
6httpsleaderboardallenaiorgopen_book_qasubmissionblcp1tu91i4gm0vf484g
Methods IHdev-Acc ()
MHGRN (K = 3) 7445 (plusmn010)- Type-specific transformation (sect41) 7316 (plusmn028)- Structured relational attention (sect42) 7326 (plusmn031)- Relation type attention (sect42) 7355 (plusmn068)- Node type attention (sect42) 7392 (plusmn065)
Table 6 Ablation study on model components (re-moving one component each time) using ROBERTA-LARGE as the text encoder We report the IHdev ac-curacy on CommonsenseQA
62 Performance Analysis
Ablation Study on Model Components Asshown in Table 6 disabling type-specific trans-formation results in sim 13 drop in performancedemonstrating the need for distinguishing nodetypes for QA tasks Our structured relational at-tention mechanism is also critical with its twosub-components contributing almost equally
Impact of the Amount of Training Data Weuse different fractions of training data of Common-senseQA and report results of fine-tuning text en-coders alone and jointly training text encoder andgraph encoder in Fig 5 Regardless of training datafraction our model shows consistently more per-formance improvement over knowledge-agnosticfine-tuning compared with the other graph encod-ing methods indicating MHGRNrsquos complementarystrengths to text encoders
20 40 60 80 100Percentage of Training Data
45
50
55
60
65
Accu
racy
()
MultiGRNKVMGconAttnwo KG
Figure 5 Performance change (accuracy in ) wrtthe amounts of training data on CommonsenseQA IHT-est set (same as Lin et al (2019))
Impact of Number of Hops (K) We investigatethe impact of hyperparameter K for MHGRN byits performance on CommonsenseQA (Fig 6) Theincrease ofK continues to bring benefits untilK =
4 However performance begins to drop whenK gt 3 This might be attributed to exponentialnoise in longer relational paths in knowledge graph
65 70 75Accuracy ()
K=1
K=2
K=3
K=4
K=5
Reasoning Hops
7324
7393
7465
7365
7416
Figure 6 Effect ofK in MHGRN We show IHDev ac-curacy of MHGRN on CommonsenseQA wrt hops
2 4 6 800
05
10
15
20
sba
tch
K Reasoning Hops
RGCNMHGRN
Figure 7 Analysis of model scalability Comparisonof per-batch training efficiency wrt hops K
63 Model ScalabilityFig 7 presents the computation cost of MHGRNand RGCN (measured by training time) Both growlinearly wrt K Although the theoretical complex-ity of MultiRGN ism times that of RGCN the ratioof their empirical cost only approaches 2 demon-strating that our model can be better parallelized
64 Model InterpretabilityWe can analyze our modelrsquos reasoning process bydecoding the reasoning path using the method de-scribed in sect45 Fig 8 shows two examples fromCommonsenseQA where our model correctly an-swers the questions and provides reasonable pathevidences In the example on the left the modellinks question entities and answer entity in a chainto support reasoning while the example on the rightprovides a case where our model leverage unmen-tioned entities to bridge the reasoning gap betweenquestion entity and answer entities in a way that iscoherent with the latent relation between CHAPEL
and the desired answer in the question
7 Related WorkKnowledge-Aware Methods for NLP Variouswork have investigated the potential to empowerNLP models with external knowledge Many at-tempt to extract structured knowledge either in theform of nodes (Yang and Mitchell 2017 Wanget al 2019) triples (Weissenborn et al 2017
Where is known for a multitude of wedding chapels
A town B texasC city D church building E Nevada
RenoChapel
PartOf
AtLocation
UsedForminus1 Synonym
AtLocation Synonym
Neveda
Baseball
Play_Baseball
HasProperty
UsedForminus1
Fun_to_Play
Why do parents encourage their kids to play baseball
A round B cheap C break window D hardE fun to play Fun_to_Play
HasProperty
Reno
Chapel
AtLocation
Neveda
PartOf
MultiGRNFigure 8 Case study on model interpretability Wepresent two sample questions from CommonsenseQAwith the reasoning paths output by MHGRN
Mihaylov and Frank 2018) paths (Bauer et al2018 Kundu et al 2019 Lin et al 2019) or sub-graphs (Li and Clark 2015) and encode them toaugment textual understanding
Recent success of pre-trained LMs motivatesmany (Pan et al 2019 Ye et al 2019 Zhang et al2018 Li et al 2019 Banerjee et al 2019) to probeLMsrsquo potential as latent knowledge bases This lineof work turn to textual knowledge (eg Wikipedia)to directly impart knowledge to pre-trained LMsThey generally fall into two paradigms 1) Fine-tuning LMs on large-scale general-domain datasets(eg RACE (Lai et al 2017)) or on knowledge-richtext 2) Providing LMs with evidence via informa-tion retrieval techniques However these modelscannot provide explicit reasoning and evidencethus hardly trustworthy They are also subject tothe availability of in-domain datasets and maxi-mum input token of pre-trained LMs
Neural Graph Encoding Graph Attention Net-works (GAT) (Velickovic et al 2018) incorpo-rates attention mechanism in feature aggregationRGCN (Schlichtkrull et al 2018) proposes rela-tional message passing which makes it applicableto multi-relational graphs However they only per-form single-hop message passing and cannot beinterpreted at path level Other work (Abu-El-Haijaet al 2019 Nikolentzos et al 2019) aggregate fora node its K-hop neighbors based on node-wisedistances but they are designed for non-relationalgraphs MHGRN addresses these issues by rea-soning on multi-relational graphs and being inter-pretable via maintaining paths as reasoning chains
8 Conclusion
We present a principled scalable method MHGRNthat can leverage general knowledge via multi-hopreasoning over interpretable structures (eg Con-ceptNet) The proposed MHGRN generalizes andcombines the advantages of GNNs and path-basedreasoning models It explicitly performs multi-hop
relational reasoning and is empirically shown tooutperform existing methods with superior scal-ablility and interpretability
ReferencesSami Abu-El-Haija Bryan Perozzi Amol Kapoor
Nazanin Alipourfard Kristina Lerman Hrayr Haru-tyunyan Greg Ver Steeg and Aram Galstyan 2019Mixhop Higher-order graph convolutional architec-tures via sparsified neighborhood mixing In Pro-ceedings of the 36th International Conference onMachine Learning ICML 2019 9-15 June 2019Long Beach California USA volume 97 of Pro-ceedings of Machine Learning Research pages 21ndash29 PMLR
Pratyay Banerjee Kuntal Kumar Pal Arindam Mi-tra and Chitta Baral 2019 Careful selection ofknowledge to solve open book question answeringIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages6120ndash6129 Florence Italy Association for Compu-tational Linguistics
Lisa Bauer Yicheng Wang and Mohit Bansal 2018Commonsense for generative multi-hop question an-swering tasks In Proceedings of the 2018 Confer-ence on Empirical Methods in Natural LanguageProcessing Brussels Belgium October 31 - Novem-ber 4 2018 pages 4220ndash4230 Association for Com-putational Linguistics
Peter Clark Isaac Cowhey Oren Etzioni Tushar KhotAshish Sabharwal Carissa Schoenick and OyvindTafjord 2018 Think you have solved questionanswering try arc the AI2 reasoning challengeCoRR abs180305457
Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics
Justin Gilmer Samuel S Schoenholz Patrick F RileyOriol Vinyals and George E Dahl 2017 Neuralmessage passing for quantum chemistry In Pro-ceedings of the 34th International Conference onMachine Learning ICML 2017 Sydney NSW Aus-tralia 6-11 August 2017 volume 70 of Proceedingsof Machine Learning Research pages 1263ndash1272PMLR
Daniel Khashabi Tushar Khot Ashish SabharwalOyvind Tafjord Peter Clark and Hannaneh Ha-jishirzi 2020 Unifiedqa Crossing format bound-aries with a single qa system
Thomas N Kipf and Max Welling 2017 Semi-supervised classification with graph convolutionalnetworks In 5th International Conference on Learn-ing Representations ICLR 2017 Toulon FranceApril 24-26 2017 Conference Track ProceedingsOpenReviewnet
Souvik Kundu Tushar Khot Ashish Sabharwal andPeter Clark 2019 Exploiting explicit paths formulti-hop reading comprehension In Proceedingsof the 57th Annual Meeting of the Association forComputational Linguistics pages 2737ndash2747 Flo-rence Italy Association for Computational Linguis-tics
John D Lafferty Andrew McCallum and FernandoC N Pereira 2001 Conditional random fieldsProbabilistic models for segmenting and labeling se-quence data In Proceedings of the Eighteenth Inter-national Conference on Machine Learning (ICML2001) Williams College Williamstown MA USAJune 28 - July 1 2001 pages 282ndash289 MorganKaufmann
Guokun Lai Qizhe Xie Hanxiao Liu Yiming Yangand Eduard Hovy 2017 RACE Large-scale ReAd-ing comprehension dataset from examinations InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing pages785ndash794 Copenhagen Denmark Association forComputational Linguistics
Zhen-Zhong Lan Mingda Chen Sebastian Good-man Kevin Gimpel Piyush Sharma and RaduSoricut 2019 Albert A lite bert for self-supervised learning of language representationsArXiv abs190911942
Shiyang Li Jianshu Chen and Dian Yu 2019 Teach-ing pretrained models with commonsense reason-ing A preliminary kb-based approach CoRRabs190909743
Yang Li and Peter Clark 2015 Answering elementaryscience questions by constructing coherent scenesusing background knowledge In Proceedings ofthe 2015 Conference on Empirical Methods in Nat-ural Language Processing pages 2007ndash2012 Lis-bon Portugal Association for Computational Lin-guistics
Bill Yuchen Lin Xinyue Chen Jamin Chen and Xi-ang Ren 2019 KagNet Knowledge-aware graphnetworks for commonsense reasoning In Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) pages 2829ndash2839 HongKong China Association for Computational Lin-guistics
Liyuan Liu Haoming Jiang Pengcheng He WeizhuChen Xiaodong Liu Jianfeng Gao and Jiawei Han2019a On the variance of the adaptive learning rateand beyond CoRR abs190803265
Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019bRoberta A robustly optimized BERT pretraining ap-proach CoRR abs190711692
Shangwen Lv Daya Guo Jingjing Xu Duyu TangNan Duan Ming Gong Linjun Shou Daxin JiangGuihong Cao and Songlin Hu 2019 Graph-based reasoning over heterogeneous external knowl-edge for commonsense question answering CoRRabs190905311
Kaixin Ma Jonathan Francis Quanyang Lu Eric Ny-berg and Alessandro Oltramari 2019 Towardsgeneralizable neuro-symbolic systems for common-sense question answering In Proceedings of theFirst Workshop on Commonsense Inference in Natu-ral Language Processing pages 22ndash32 Hong KongChina Association for Computational Linguistics
Todor Mihaylov Peter Clark Tushar Khot and AshishSabharwal 2018 Can a suit of armor conduct elec-tricity a new dataset for open book question an-swering In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processingpages 2381ndash2391 Brussels Belgium Associationfor Computational Linguistics
Todor Mihaylov and Anette Frank 2018 Knowledge-able reader Enhancing cloze-style reading compre-hension with external commonsense knowledge InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 821ndash832 Melbourne AustraliaAssociation for Computational Linguistics
Giannis Nikolentzos George Dasoulas and MichalisVazirgiannis 2019 k-hop graph neural networksCoRR abs190706051
Xiaoman Pan Kai Sun Dian Yu Heng Ji and Dong Yu2019 Improving question answering with externalknowledge CoRR abs190200993
Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2019 Exploring the limitsof transfer learning with a unified text-to-text trans-former
Adam Santoro David Raposo David G T BarrettMateusz Malinowski Razvan Pascanu Peter WBattaglia and Tim Lillicrap 2017 A simple neu-ral network module for relational reasoning In Ad-vances in Neural Information Processing Systems30 Annual Conference on Neural Information Pro-cessing Systems 2017 4-9 December 2017 LongBeach CA USA pages 4967ndash4976
Maarten Sap Hannah Rashkin Derek Chen RonanLe Bras and Yejin Choi 2019 Social IQa Com-monsense reasoning about social interactions InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the
9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) pages 4463ndash4473 Hong Kong China Association for Computa-tional Linguistics
Michael Sejr Schlichtkrull Thomas N Kipf PeterBloem Rianne van den Berg Ivan Titov and MaxWelling 2018 Modeling relational data with graphconvolutional networks In The Semantic Web - 15thInternational Conference ESWC 2018 HeraklionCrete Greece June 3-7 2018 Proceedings volume10843 of Lecture Notes in Computer Science pages593ndash607 Springer
Robyn Speer Joshua Chin and Catherine Havasi 2017Conceptnet 55 An open multilingual graph of gen-eral knowledge In Proceedings of the Thirty-FirstAAAI Conference on Artificial Intelligence Febru-ary 4-9 2017 San Francisco California USApages 4444ndash4451 AAAI Press
Alon Talmor Jonathan Herzig Nicholas Lourie andJonathan Berant 2019 CommonsenseQA A ques-tion answering challenge targeting commonsenseknowledge In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4149ndash4158 Minneapolis Minnesota Associ-ation for Computational Linguistics
Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez LukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems 30 Annual Conference on NeuralInformation Processing Systems 2017 4-9 Decem-ber 2017 Long Beach CA USA pages 5998ndash6008
Petar Velickovic Guillem Cucurull Arantxa CasanovaAdriana Romero Pietro Lio and Yoshua Bengio2018 Graph attention networks In 6th Inter-national Conference on Learning RepresentationsICLR 2018 Vancouver BC Canada April 30 - May3 2018 Conference Track Proceedings OpenRe-viewnet
Shuohang Wang and Jing Jiang 2016 Learning nat-ural language inference with LSTM In Proceed-ings of the 2016 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics Human Language Technologies pages1442ndash1451 San Diego California Association forComputational Linguistics
Xiaoyan Wang Pavan Kapanipathi Ryan MusaMo Yu Kartik Talamadupula Ibrahim AbdelazizMaria Chang Achille Fokoue Bassem MakniNicholas Mattei and Michael Witbrock 2019 Im-proving natural language inference using externalknowledge in the science questions domain In TheThirty-Third AAAI Conference on Artificial Intelli-gence AAAI 2019 The Thirty-First Innovative Ap-plications of Artificial Intelligence Conference IAAI2019 The Ninth AAAI Symposium on Educational
Advances in Artificial Intelligence EAAI 2019 Hon-olulu Hawaii USA January 27 - February 1 2019pages 7208ndash7215 AAAI Press
Dirk Weissenborn Tomas Kocisky and Chris Dyer2017 Dynamic integration of background knowl-edge in neural nlu systems arXiv preprintarXiv170602596
An Yang Quan Wang Jing Liu Kai Liu Yajuan LyuHua Wu Qiaoqiao She and Sujian Li 2019 En-hancing pre-trained language representations withrich knowledge for machine reading comprehensionIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2346ndash2357 Florence Italy Association for Compu-tational Linguistics
Bishan Yang and Tom Mitchell 2017 Leveragingknowledge bases in LSTMs for improving machinereading In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1 Long Papers) pages 1436ndash1446 Van-couver Canada Association for Computational Lin-guistics
Zhi-Xiu Ye Qian Chen Wen Wang and Zhen-HuaLing 2019 Align mask and select A sim-ple method for incorporating commonsense knowl-edge into language representation models CoRRabs190806725
Yuyu Zhang Hanjun Dai Kamil Toraman andLe Song 2018 Kgˆ2 Learning to reason scienceexam questions with contextual knowledge graphembeddings CoRR abs180512393
Chen Zhu Yu Cheng Zhe Gan Siqi Sun Tom Gold-stein and Jingjing Liu 2020 Freelb Enhanced ad-versarial training for natural language understandingIn International Conference on Learning Represen-tations
A Merging Types of Relations inConceptNet
Relation Merged Relation
AtLocationAtLocation
LocatedNear
CausesCausesCausesDesire
MotivatedByGoal
AntonymAntonym
DistinctFrom
HasSubevent
HasSubevent
HasFirstSubeventHasLastSubeventHasPrerequisite
EntailsMannerOf
IsAIsAInstanceOf
DefinedAs
PartOfPartOf
HasA
RelatedToRelatedToSimilarTo
Synonym
Table 7 Relations in ConceptNet that are being mergedin pre-processing RelationX indicates the reverse re-lation of RelationX
We merge relations that are close in semanticsas well as in the general usage of triple instancesin ConceptNet
B Dataset Split Specifications
Train Dev Test
CommonsenseQA (OF) 9 741 1 221 1 140CommonsenseQA (IH) 8 500 1 221 1 241OpenbookQA 4 957 500 500
Table 8 Numbers of instances in different datasetsplits
CommonsenseQA 7 and OpenbookQA 8 all havetheir leaderboards with training and development
7httpswwwtau-nlporgcommonsenseqa8httpsleaderboardallenaiorgopen_
book_qasubmissionspublic
set publicly available As the ground truth labelsfor CommonsenseQA are not readily available formodel analysis we take 1241 examples from of-ficial training examples as our in-house test ex-amples and regard the remaining 8500 ones asour in-house training examples (CommonsenseQA(IH))
C Implementation Details
CommonsenseQA OpenbookQA
BERT-BASE 3 times 10minus5 -
BERT-LARGE 2 times 10minus5 -
ROBERTA-LARGE 1 times 10minus5
1 times 10minus5
Table 9 Learning rate for text encoders on differentdatasets
CommonsenseQA OpenbookQA
RN 3 times 10minus4
3 times 10minus4
RGCN 1 times 10minus3
1 times 10minus3
GconAttn 3 times 10minus4
1 times 10minus4
MHGRN 1 times 10minus3
1 times 10minus3
Table 10 Learning rate for graph encoders on differentdatasets
Param
RN 399KRGCN 365KGconAttn 453KMHGRN 544K
Table 11 Numbers of parameters of different graph en-coders
Our models are implemented in PyTorch We usecross-entropy loss and RAdam (Liu et al 2019a)optimizer We find it beneficial to use separatelearning rates for the text encoder and the graph en-coder We tune learning rates for text encoders andgraph encoders on two datasets We first fine-tuneROBERTA-LARGE BERT-LARGE BERT-BASE
on CommonsenseQA and ROBERTA-LARGE onOpenbookQA respectively and choose a dataset-specific learning rate from 1times10
minus5 2times10
minus5 3times
10minus5 6 times 10
minus5 1 times 10
minus4 for each text encoderbased on the best performance on development setas listed in Table 9 We report the performance ofthese fine-tuned text encoders and also adopt theirdataset-specific optimal learning rates in joint train-ing with graph encoders For models that involve
KG the learning rate of their graph encoders arechosen from 1 times 10
minus4 3 times 10
minus4 1 times 10
minus3 3 times
10minus3 based on their best development set perfor-
mance with ROBERTA-LARGE as the text encoderWe report the optimal learning rates for graph en-coders in Table 10 In training we set the max-imum input sequence length to text encoders to64 batch size to 32 and perform early stoppingAristoRoBERTaV7+MHGRN is the only exceptionIn order to host fair comparison we follow Aris-toRoBERTaV7 and set the batch size to 16 maxinput sequence length to 256 and choose a decoderlearning rate from 1 times 10
minus3 2 times 10
minus5For the input node features we first use tem-
plates to turn knowledge triples in ConceptNet intosentences and feed them into pre-trained BERT-LARGE obtaining a sequence of token embeddingsfrom the last layer of BERT-LARGE for each tripleFor each entity in ConceptNet we perform meanpooling over the tokens of the entityrsquos occurrencesacross all the sentences to form a 1024d vector asits corresponding node feature We use this set offeatures for all our implemented models
We use 2-layer RGCN and single-layer MHGRNacross our experiments
The numbers of parameter for each graph en-coder are listed in Table 11
D Dynamic Programming Algorithm forEq 7
To show that multi-hop message passing can becomputed in linear time we observe that Eq 7 canbe re-written in matrix form
Zk= (Dk)minus1 sum
(r1rk)isinRk
β(r1 rk s)
sdotGArk⋯Ar1FXW1r1
⊤⋯W
krk
⊤
sdotWk+10
⊤⋯W
K0
⊤ (1 le k le K) (13)
where G = diag(exp([g(φ(v1) s) g(φ(vn)s)]) (F is similarly defined) Ar is the adjacencymatrix for relation r and D
k is defined as follows
Dk= diag( sum
(r1rk)isinRk
β(r1 rk s)
sdotGArk⋯Ar1FX1) (1 le k le K) (14)
Using this matrix formulation we can compute Eq7 using dynamic programming
Algorithm 1 Dynamic programming algorithm formulti-hop message passingInput sXAr(1 le r le m)W t
r (r isin R 1 le t le
k)F G δ τOutput Z
1 WKlarr I
2 for k larr K minus 1 to 1 do3 W
klarrW
k+10 W
k+1
4 end for5 for r isin R do6 Mr larr FX7 end for8 for k larr 1 to K do9 if k gt 1 then
10 for r isin R do11 M
primer larr e
δ(rs)ArsumrprimeisinR e
τ(rprimers)MrprimeW
kr
⊤
12 end for13 for r isin R do14 Mr larrM
primer
15 end for16 else17 for r isin R do18 Mr larr e
δ(rs) sdotArMrWkr
⊤
19 end for20 end if21 Z
klarr GsumrisinR MrW
k
22 end for23 Replace W
tr (0 le r le m 1 le t le k) with identity
matrices and X with 1 and re-run line 1 - line 19 tocompute d
1 d
K
24 for k larr 1 to K do25 Z
klarr (diag(dk))minus1
Zk
26 end for27 return Z
1Z
2 Z
k
E Formal Definition of K-hop RN
Definition 1 (K-hop Relation Network) A multi-hop relation network is a function that maps amulti-relational graph to a fixed size vector
KHopRN(G W E H) =K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(j r1 rk i)sdotW (hjoplus(er1⋯erk)oplushi)(15)
where denotes element-wise product and β(⋯) =1(K∣A∣ sdot ∣(j prime i) isin G ∣ j prime isin Q∣) defines thepooling weights
F Expressing K-hop RN with MultiGRN
Theorem 1 Given any W E H there exists a pa-rameter setting such that the output of the modelbecomes KHopRN(GW E H) for arbitrary G
Proof Suppose W = [W1 W2 W3] whereW1 W3 isin Rd3timesd1 W2 isin Rd3timesd2 ForMHRGN we set the parameters as followsH = HUlowast = [I0] isin R(d1+d2)timesd1 blowast =
[01]⊤ isin Rd1+d2 Wtr = diag(1 oplus er) isin
R(d1+d2)times(d1+d2)(r isin R 1 le t le K)V = W3 isin
Rd3timesd1 Vprime= [W1 W2] isin Rd3times(d1+d2) We dis-
able the relation type attention module and enablemessage passing only from Q to A By furtherchoosing σ as the identity function and perform-ing pooling over A we observe that the output of
MultiGRN becomes
1
∣A∣ sumiisinA
hprimei
=1
∣A∣ (V hi + Vprimezi)
=1
K∣A∣K
sumk=1
(V hi + Vprimezki )
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(j r1 rk i)(V hi+
VprimeW
krk⋯W
1r1xj)
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ
β(⋯)(V hi + VprimeW
krk
⋯W1r1Uφ(j)hj + V
primeW
krk⋯W
1r1bφ(j))
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(⋯)(W3hi + W1hj
+ W2(er1 ⋯ erk))
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(⋯)W (hjoplus
(er1 ⋯ erk)oplus hi)= RN(G W E H)
(16)
Methods Dev () Test ()
T5-3Bdagger (Raffel et al 2019) - 8320UnifiedQAdagger (Khashabi et al 2020) - 8720
RoBERTa-Large (wo KG) 6676 (plusmn114) 6480 (plusmn237)+ RGCN 6465 (plusmn196) 6245 (plusmn157)+ GconAttn 6430 (plusmn099) 6190 (plusmn244)+ RN (1-hop) 6485 (plusmn111) 6365 (plusmn231)+ RN (2-hop) 6700 (plusmn071) 6520 (plusmn118)+ MHGRN (K = 3) 6810 (plusmn102) 6685 (plusmn119)
AristoRoBERTaV7dagger 792 778+ MHGRN (K = 3) 786 806
Table 5 Performance comparison on OpenbookQAdagger indicates reported results on leaderboard T5-3B is 8times larger than our models UnifiedQA is 30x larger
61 Main ResultsFor CommonsenseQA (Table 3) we first use thein-house data split (IH) of Lin et al (2019) (cfAppendix B) to compare our models with imple-mented baselines This is different from the offi-cial split used in the leaderboard methods Almostall KG-augmented models achieve performancegain over vanilla pre-trained LMs demonstratingthe value of external knowledge on this dataset Ad-ditionally we evaluate our MHGRN (with text en-coder being ROBERTA-LARGE) on official splitOF (Table 4) for fair comparison with other meth-ods on leaderboard in both single-model settingand ensemble-model setting With backbone be-ing T5 (Raffel et al 2019) UnifiedQA (Khashabiet al 2020) tops the leaderboard Consideringits training cost we do not intend to compare ourROBERTA-based model with it We achieve thebest performances among other models
For OpenbookQA (Table 5) we use official splitand build models with ROBERTA-LARGE as textencoder MHGRN surpasses all implemented base-lines with an absolute increase of sim2 on TestAlso as our approach is naturally compatible withthe methods that utilize textual knowledge or ex-tra data because in our paradigm the encodingof textual statement and graph are structurally-decoupled (Fig 3) To empirically show MHGRNcan bring gain over textual-knowledge empow-ered systems we replace our text encoder withAristoRoBERTaV76 and fine-tune our MHGRNupon OpenbookQA Empirically MHGRN contin-ues to bring benefit to strong-performing textual-knowledge empowered systems This indicatestextual knowledge and structured knowledge canpotentially be complementary
6httpsleaderboardallenaiorgopen_book_qasubmissionblcp1tu91i4gm0vf484g
Methods IHdev-Acc ()
MHGRN (K = 3) 7445 (plusmn010)- Type-specific transformation (sect41) 7316 (plusmn028)- Structured relational attention (sect42) 7326 (plusmn031)- Relation type attention (sect42) 7355 (plusmn068)- Node type attention (sect42) 7392 (plusmn065)
Table 6 Ablation study on model components (re-moving one component each time) using ROBERTA-LARGE as the text encoder We report the IHdev ac-curacy on CommonsenseQA
62 Performance Analysis
Ablation Study on Model Components Asshown in Table 6 disabling type-specific trans-formation results in sim 13 drop in performancedemonstrating the need for distinguishing nodetypes for QA tasks Our structured relational at-tention mechanism is also critical with its twosub-components contributing almost equally
Impact of the Amount of Training Data Weuse different fractions of training data of Common-senseQA and report results of fine-tuning text en-coders alone and jointly training text encoder andgraph encoder in Fig 5 Regardless of training datafraction our model shows consistently more per-formance improvement over knowledge-agnosticfine-tuning compared with the other graph encod-ing methods indicating MHGRNrsquos complementarystrengths to text encoders
20 40 60 80 100Percentage of Training Data
45
50
55
60
65
Accu
racy
()
MultiGRNKVMGconAttnwo KG
Figure 5 Performance change (accuracy in ) wrtthe amounts of training data on CommonsenseQA IHT-est set (same as Lin et al (2019))
Impact of Number of Hops (K) We investigatethe impact of hyperparameter K for MHGRN byits performance on CommonsenseQA (Fig 6) Theincrease ofK continues to bring benefits untilK =
4 However performance begins to drop whenK gt 3 This might be attributed to exponentialnoise in longer relational paths in knowledge graph
65 70 75Accuracy ()
K=1
K=2
K=3
K=4
K=5
Reasoning Hops
7324
7393
7465
7365
7416
Figure 6 Effect ofK in MHGRN We show IHDev ac-curacy of MHGRN on CommonsenseQA wrt hops
2 4 6 800
05
10
15
20
sba
tch
K Reasoning Hops
RGCNMHGRN
Figure 7 Analysis of model scalability Comparisonof per-batch training efficiency wrt hops K
63 Model ScalabilityFig 7 presents the computation cost of MHGRNand RGCN (measured by training time) Both growlinearly wrt K Although the theoretical complex-ity of MultiRGN ism times that of RGCN the ratioof their empirical cost only approaches 2 demon-strating that our model can be better parallelized
64 Model InterpretabilityWe can analyze our modelrsquos reasoning process bydecoding the reasoning path using the method de-scribed in sect45 Fig 8 shows two examples fromCommonsenseQA where our model correctly an-swers the questions and provides reasonable pathevidences In the example on the left the modellinks question entities and answer entity in a chainto support reasoning while the example on the rightprovides a case where our model leverage unmen-tioned entities to bridge the reasoning gap betweenquestion entity and answer entities in a way that iscoherent with the latent relation between CHAPEL
and the desired answer in the question
7 Related WorkKnowledge-Aware Methods for NLP Variouswork have investigated the potential to empowerNLP models with external knowledge Many at-tempt to extract structured knowledge either in theform of nodes (Yang and Mitchell 2017 Wanget al 2019) triples (Weissenborn et al 2017
Where is known for a multitude of wedding chapels
A town B texasC city D church building E Nevada
RenoChapel
PartOf
AtLocation
UsedForminus1 Synonym
AtLocation Synonym
Neveda
Baseball
Play_Baseball
HasProperty
UsedForminus1
Fun_to_Play
Why do parents encourage their kids to play baseball
A round B cheap C break window D hardE fun to play Fun_to_Play
HasProperty
Reno
Chapel
AtLocation
Neveda
PartOf
MultiGRNFigure 8 Case study on model interpretability Wepresent two sample questions from CommonsenseQAwith the reasoning paths output by MHGRN
Mihaylov and Frank 2018) paths (Bauer et al2018 Kundu et al 2019 Lin et al 2019) or sub-graphs (Li and Clark 2015) and encode them toaugment textual understanding
Recent success of pre-trained LMs motivatesmany (Pan et al 2019 Ye et al 2019 Zhang et al2018 Li et al 2019 Banerjee et al 2019) to probeLMsrsquo potential as latent knowledge bases This lineof work turn to textual knowledge (eg Wikipedia)to directly impart knowledge to pre-trained LMsThey generally fall into two paradigms 1) Fine-tuning LMs on large-scale general-domain datasets(eg RACE (Lai et al 2017)) or on knowledge-richtext 2) Providing LMs with evidence via informa-tion retrieval techniques However these modelscannot provide explicit reasoning and evidencethus hardly trustworthy They are also subject tothe availability of in-domain datasets and maxi-mum input token of pre-trained LMs
Neural Graph Encoding Graph Attention Net-works (GAT) (Velickovic et al 2018) incorpo-rates attention mechanism in feature aggregationRGCN (Schlichtkrull et al 2018) proposes rela-tional message passing which makes it applicableto multi-relational graphs However they only per-form single-hop message passing and cannot beinterpreted at path level Other work (Abu-El-Haijaet al 2019 Nikolentzos et al 2019) aggregate fora node its K-hop neighbors based on node-wisedistances but they are designed for non-relationalgraphs MHGRN addresses these issues by rea-soning on multi-relational graphs and being inter-pretable via maintaining paths as reasoning chains
8 Conclusion
We present a principled scalable method MHGRNthat can leverage general knowledge via multi-hopreasoning over interpretable structures (eg Con-ceptNet) The proposed MHGRN generalizes andcombines the advantages of GNNs and path-basedreasoning models It explicitly performs multi-hop
relational reasoning and is empirically shown tooutperform existing methods with superior scal-ablility and interpretability
ReferencesSami Abu-El-Haija Bryan Perozzi Amol Kapoor
Nazanin Alipourfard Kristina Lerman Hrayr Haru-tyunyan Greg Ver Steeg and Aram Galstyan 2019Mixhop Higher-order graph convolutional architec-tures via sparsified neighborhood mixing In Pro-ceedings of the 36th International Conference onMachine Learning ICML 2019 9-15 June 2019Long Beach California USA volume 97 of Pro-ceedings of Machine Learning Research pages 21ndash29 PMLR
Pratyay Banerjee Kuntal Kumar Pal Arindam Mi-tra and Chitta Baral 2019 Careful selection ofknowledge to solve open book question answeringIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages6120ndash6129 Florence Italy Association for Compu-tational Linguistics
Lisa Bauer Yicheng Wang and Mohit Bansal 2018Commonsense for generative multi-hop question an-swering tasks In Proceedings of the 2018 Confer-ence on Empirical Methods in Natural LanguageProcessing Brussels Belgium October 31 - Novem-ber 4 2018 pages 4220ndash4230 Association for Com-putational Linguistics
Peter Clark Isaac Cowhey Oren Etzioni Tushar KhotAshish Sabharwal Carissa Schoenick and OyvindTafjord 2018 Think you have solved questionanswering try arc the AI2 reasoning challengeCoRR abs180305457
Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics
Justin Gilmer Samuel S Schoenholz Patrick F RileyOriol Vinyals and George E Dahl 2017 Neuralmessage passing for quantum chemistry In Pro-ceedings of the 34th International Conference onMachine Learning ICML 2017 Sydney NSW Aus-tralia 6-11 August 2017 volume 70 of Proceedingsof Machine Learning Research pages 1263ndash1272PMLR
Daniel Khashabi Tushar Khot Ashish SabharwalOyvind Tafjord Peter Clark and Hannaneh Ha-jishirzi 2020 Unifiedqa Crossing format bound-aries with a single qa system
Thomas N Kipf and Max Welling 2017 Semi-supervised classification with graph convolutionalnetworks In 5th International Conference on Learn-ing Representations ICLR 2017 Toulon FranceApril 24-26 2017 Conference Track ProceedingsOpenReviewnet
Souvik Kundu Tushar Khot Ashish Sabharwal andPeter Clark 2019 Exploiting explicit paths formulti-hop reading comprehension In Proceedingsof the 57th Annual Meeting of the Association forComputational Linguistics pages 2737ndash2747 Flo-rence Italy Association for Computational Linguis-tics
John D Lafferty Andrew McCallum and FernandoC N Pereira 2001 Conditional random fieldsProbabilistic models for segmenting and labeling se-quence data In Proceedings of the Eighteenth Inter-national Conference on Machine Learning (ICML2001) Williams College Williamstown MA USAJune 28 - July 1 2001 pages 282ndash289 MorganKaufmann
Guokun Lai Qizhe Xie Hanxiao Liu Yiming Yangand Eduard Hovy 2017 RACE Large-scale ReAd-ing comprehension dataset from examinations InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing pages785ndash794 Copenhagen Denmark Association forComputational Linguistics
Zhen-Zhong Lan Mingda Chen Sebastian Good-man Kevin Gimpel Piyush Sharma and RaduSoricut 2019 Albert A lite bert for self-supervised learning of language representationsArXiv abs190911942
Shiyang Li Jianshu Chen and Dian Yu 2019 Teach-ing pretrained models with commonsense reason-ing A preliminary kb-based approach CoRRabs190909743
Yang Li and Peter Clark 2015 Answering elementaryscience questions by constructing coherent scenesusing background knowledge In Proceedings ofthe 2015 Conference on Empirical Methods in Nat-ural Language Processing pages 2007ndash2012 Lis-bon Portugal Association for Computational Lin-guistics
Bill Yuchen Lin Xinyue Chen Jamin Chen and Xi-ang Ren 2019 KagNet Knowledge-aware graphnetworks for commonsense reasoning In Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) pages 2829ndash2839 HongKong China Association for Computational Lin-guistics
Liyuan Liu Haoming Jiang Pengcheng He WeizhuChen Xiaodong Liu Jianfeng Gao and Jiawei Han2019a On the variance of the adaptive learning rateand beyond CoRR abs190803265
Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019bRoberta A robustly optimized BERT pretraining ap-proach CoRR abs190711692
Shangwen Lv Daya Guo Jingjing Xu Duyu TangNan Duan Ming Gong Linjun Shou Daxin JiangGuihong Cao and Songlin Hu 2019 Graph-based reasoning over heterogeneous external knowl-edge for commonsense question answering CoRRabs190905311
Kaixin Ma Jonathan Francis Quanyang Lu Eric Ny-berg and Alessandro Oltramari 2019 Towardsgeneralizable neuro-symbolic systems for common-sense question answering In Proceedings of theFirst Workshop on Commonsense Inference in Natu-ral Language Processing pages 22ndash32 Hong KongChina Association for Computational Linguistics
Todor Mihaylov Peter Clark Tushar Khot and AshishSabharwal 2018 Can a suit of armor conduct elec-tricity a new dataset for open book question an-swering In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processingpages 2381ndash2391 Brussels Belgium Associationfor Computational Linguistics
Todor Mihaylov and Anette Frank 2018 Knowledge-able reader Enhancing cloze-style reading compre-hension with external commonsense knowledge InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 821ndash832 Melbourne AustraliaAssociation for Computational Linguistics
Giannis Nikolentzos George Dasoulas and MichalisVazirgiannis 2019 k-hop graph neural networksCoRR abs190706051
Xiaoman Pan Kai Sun Dian Yu Heng Ji and Dong Yu2019 Improving question answering with externalknowledge CoRR abs190200993
Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2019 Exploring the limitsof transfer learning with a unified text-to-text trans-former
Adam Santoro David Raposo David G T BarrettMateusz Malinowski Razvan Pascanu Peter WBattaglia and Tim Lillicrap 2017 A simple neu-ral network module for relational reasoning In Ad-vances in Neural Information Processing Systems30 Annual Conference on Neural Information Pro-cessing Systems 2017 4-9 December 2017 LongBeach CA USA pages 4967ndash4976
Maarten Sap Hannah Rashkin Derek Chen RonanLe Bras and Yejin Choi 2019 Social IQa Com-monsense reasoning about social interactions InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the
9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) pages 4463ndash4473 Hong Kong China Association for Computa-tional Linguistics
Michael Sejr Schlichtkrull Thomas N Kipf PeterBloem Rianne van den Berg Ivan Titov and MaxWelling 2018 Modeling relational data with graphconvolutional networks In The Semantic Web - 15thInternational Conference ESWC 2018 HeraklionCrete Greece June 3-7 2018 Proceedings volume10843 of Lecture Notes in Computer Science pages593ndash607 Springer
Robyn Speer Joshua Chin and Catherine Havasi 2017Conceptnet 55 An open multilingual graph of gen-eral knowledge In Proceedings of the Thirty-FirstAAAI Conference on Artificial Intelligence Febru-ary 4-9 2017 San Francisco California USApages 4444ndash4451 AAAI Press
Alon Talmor Jonathan Herzig Nicholas Lourie andJonathan Berant 2019 CommonsenseQA A ques-tion answering challenge targeting commonsenseknowledge In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4149ndash4158 Minneapolis Minnesota Associ-ation for Computational Linguistics
Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez LukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems 30 Annual Conference on NeuralInformation Processing Systems 2017 4-9 Decem-ber 2017 Long Beach CA USA pages 5998ndash6008
Petar Velickovic Guillem Cucurull Arantxa CasanovaAdriana Romero Pietro Lio and Yoshua Bengio2018 Graph attention networks In 6th Inter-national Conference on Learning RepresentationsICLR 2018 Vancouver BC Canada April 30 - May3 2018 Conference Track Proceedings OpenRe-viewnet
Shuohang Wang and Jing Jiang 2016 Learning nat-ural language inference with LSTM In Proceed-ings of the 2016 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics Human Language Technologies pages1442ndash1451 San Diego California Association forComputational Linguistics
Xiaoyan Wang Pavan Kapanipathi Ryan MusaMo Yu Kartik Talamadupula Ibrahim AbdelazizMaria Chang Achille Fokoue Bassem MakniNicholas Mattei and Michael Witbrock 2019 Im-proving natural language inference using externalknowledge in the science questions domain In TheThirty-Third AAAI Conference on Artificial Intelli-gence AAAI 2019 The Thirty-First Innovative Ap-plications of Artificial Intelligence Conference IAAI2019 The Ninth AAAI Symposium on Educational
Advances in Artificial Intelligence EAAI 2019 Hon-olulu Hawaii USA January 27 - February 1 2019pages 7208ndash7215 AAAI Press
Dirk Weissenborn Tomas Kocisky and Chris Dyer2017 Dynamic integration of background knowl-edge in neural nlu systems arXiv preprintarXiv170602596
An Yang Quan Wang Jing Liu Kai Liu Yajuan LyuHua Wu Qiaoqiao She and Sujian Li 2019 En-hancing pre-trained language representations withrich knowledge for machine reading comprehensionIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2346ndash2357 Florence Italy Association for Compu-tational Linguistics
Bishan Yang and Tom Mitchell 2017 Leveragingknowledge bases in LSTMs for improving machinereading In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1 Long Papers) pages 1436ndash1446 Van-couver Canada Association for Computational Lin-guistics
Zhi-Xiu Ye Qian Chen Wen Wang and Zhen-HuaLing 2019 Align mask and select A sim-ple method for incorporating commonsense knowl-edge into language representation models CoRRabs190806725
Yuyu Zhang Hanjun Dai Kamil Toraman andLe Song 2018 Kgˆ2 Learning to reason scienceexam questions with contextual knowledge graphembeddings CoRR abs180512393
Chen Zhu Yu Cheng Zhe Gan Siqi Sun Tom Gold-stein and Jingjing Liu 2020 Freelb Enhanced ad-versarial training for natural language understandingIn International Conference on Learning Represen-tations
A Merging Types of Relations inConceptNet
Relation Merged Relation
AtLocationAtLocation
LocatedNear
CausesCausesCausesDesire
MotivatedByGoal
AntonymAntonym
DistinctFrom
HasSubevent
HasSubevent
HasFirstSubeventHasLastSubeventHasPrerequisite
EntailsMannerOf
IsAIsAInstanceOf
DefinedAs
PartOfPartOf
HasA
RelatedToRelatedToSimilarTo
Synonym
Table 7 Relations in ConceptNet that are being mergedin pre-processing RelationX indicates the reverse re-lation of RelationX
We merge relations that are close in semanticsas well as in the general usage of triple instancesin ConceptNet
B Dataset Split Specifications
Train Dev Test
CommonsenseQA (OF) 9 741 1 221 1 140CommonsenseQA (IH) 8 500 1 221 1 241OpenbookQA 4 957 500 500
Table 8 Numbers of instances in different datasetsplits
CommonsenseQA 7 and OpenbookQA 8 all havetheir leaderboards with training and development
7httpswwwtau-nlporgcommonsenseqa8httpsleaderboardallenaiorgopen_
book_qasubmissionspublic
set publicly available As the ground truth labelsfor CommonsenseQA are not readily available formodel analysis we take 1241 examples from of-ficial training examples as our in-house test ex-amples and regard the remaining 8500 ones asour in-house training examples (CommonsenseQA(IH))
C Implementation Details
CommonsenseQA OpenbookQA
BERT-BASE 3 times 10minus5 -
BERT-LARGE 2 times 10minus5 -
ROBERTA-LARGE 1 times 10minus5
1 times 10minus5
Table 9 Learning rate for text encoders on differentdatasets
CommonsenseQA OpenbookQA
RN 3 times 10minus4
3 times 10minus4
RGCN 1 times 10minus3
1 times 10minus3
GconAttn 3 times 10minus4
1 times 10minus4
MHGRN 1 times 10minus3
1 times 10minus3
Table 10 Learning rate for graph encoders on differentdatasets
Param
RN 399KRGCN 365KGconAttn 453KMHGRN 544K
Table 11 Numbers of parameters of different graph en-coders
Our models are implemented in PyTorch We usecross-entropy loss and RAdam (Liu et al 2019a)optimizer We find it beneficial to use separatelearning rates for the text encoder and the graph en-coder We tune learning rates for text encoders andgraph encoders on two datasets We first fine-tuneROBERTA-LARGE BERT-LARGE BERT-BASE
on CommonsenseQA and ROBERTA-LARGE onOpenbookQA respectively and choose a dataset-specific learning rate from 1times10
minus5 2times10
minus5 3times
10minus5 6 times 10
minus5 1 times 10
minus4 for each text encoderbased on the best performance on development setas listed in Table 9 We report the performance ofthese fine-tuned text encoders and also adopt theirdataset-specific optimal learning rates in joint train-ing with graph encoders For models that involve
KG the learning rate of their graph encoders arechosen from 1 times 10
minus4 3 times 10
minus4 1 times 10
minus3 3 times
10minus3 based on their best development set perfor-
mance with ROBERTA-LARGE as the text encoderWe report the optimal learning rates for graph en-coders in Table 10 In training we set the max-imum input sequence length to text encoders to64 batch size to 32 and perform early stoppingAristoRoBERTaV7+MHGRN is the only exceptionIn order to host fair comparison we follow Aris-toRoBERTaV7 and set the batch size to 16 maxinput sequence length to 256 and choose a decoderlearning rate from 1 times 10
minus3 2 times 10
minus5For the input node features we first use tem-
plates to turn knowledge triples in ConceptNet intosentences and feed them into pre-trained BERT-LARGE obtaining a sequence of token embeddingsfrom the last layer of BERT-LARGE for each tripleFor each entity in ConceptNet we perform meanpooling over the tokens of the entityrsquos occurrencesacross all the sentences to form a 1024d vector asits corresponding node feature We use this set offeatures for all our implemented models
We use 2-layer RGCN and single-layer MHGRNacross our experiments
The numbers of parameter for each graph en-coder are listed in Table 11
D Dynamic Programming Algorithm forEq 7
To show that multi-hop message passing can becomputed in linear time we observe that Eq 7 canbe re-written in matrix form
Zk= (Dk)minus1 sum
(r1rk)isinRk
β(r1 rk s)
sdotGArk⋯Ar1FXW1r1
⊤⋯W
krk
⊤
sdotWk+10
⊤⋯W
K0
⊤ (1 le k le K) (13)
where G = diag(exp([g(φ(v1) s) g(φ(vn)s)]) (F is similarly defined) Ar is the adjacencymatrix for relation r and D
k is defined as follows
Dk= diag( sum
(r1rk)isinRk
β(r1 rk s)
sdotGArk⋯Ar1FX1) (1 le k le K) (14)
Using this matrix formulation we can compute Eq7 using dynamic programming
Algorithm 1 Dynamic programming algorithm formulti-hop message passingInput sXAr(1 le r le m)W t
r (r isin R 1 le t le
k)F G δ τOutput Z
1 WKlarr I
2 for k larr K minus 1 to 1 do3 W
klarrW
k+10 W
k+1
4 end for5 for r isin R do6 Mr larr FX7 end for8 for k larr 1 to K do9 if k gt 1 then
10 for r isin R do11 M
primer larr e
δ(rs)ArsumrprimeisinR e
τ(rprimers)MrprimeW
kr
⊤
12 end for13 for r isin R do14 Mr larrM
primer
15 end for16 else17 for r isin R do18 Mr larr e
δ(rs) sdotArMrWkr
⊤
19 end for20 end if21 Z
klarr GsumrisinR MrW
k
22 end for23 Replace W
tr (0 le r le m 1 le t le k) with identity
matrices and X with 1 and re-run line 1 - line 19 tocompute d
1 d
K
24 for k larr 1 to K do25 Z
klarr (diag(dk))minus1
Zk
26 end for27 return Z
1Z
2 Z
k
E Formal Definition of K-hop RN
Definition 1 (K-hop Relation Network) A multi-hop relation network is a function that maps amulti-relational graph to a fixed size vector
KHopRN(G W E H) =K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(j r1 rk i)sdotW (hjoplus(er1⋯erk)oplushi)(15)
where denotes element-wise product and β(⋯) =1(K∣A∣ sdot ∣(j prime i) isin G ∣ j prime isin Q∣) defines thepooling weights
F Expressing K-hop RN with MultiGRN
Theorem 1 Given any W E H there exists a pa-rameter setting such that the output of the modelbecomes KHopRN(GW E H) for arbitrary G
Proof Suppose W = [W1 W2 W3] whereW1 W3 isin Rd3timesd1 W2 isin Rd3timesd2 ForMHRGN we set the parameters as followsH = HUlowast = [I0] isin R(d1+d2)timesd1 blowast =
[01]⊤ isin Rd1+d2 Wtr = diag(1 oplus er) isin
R(d1+d2)times(d1+d2)(r isin R 1 le t le K)V = W3 isin
Rd3timesd1 Vprime= [W1 W2] isin Rd3times(d1+d2) We dis-
able the relation type attention module and enablemessage passing only from Q to A By furtherchoosing σ as the identity function and perform-ing pooling over A we observe that the output of
MultiGRN becomes
1
∣A∣ sumiisinA
hprimei
=1
∣A∣ (V hi + Vprimezi)
=1
K∣A∣K
sumk=1
(V hi + Vprimezki )
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(j r1 rk i)(V hi+
VprimeW
krk⋯W
1r1xj)
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ
β(⋯)(V hi + VprimeW
krk
⋯W1r1Uφ(j)hj + V
primeW
krk⋯W
1r1bφ(j))
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(⋯)(W3hi + W1hj
+ W2(er1 ⋯ erk))
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(⋯)W (hjoplus
(er1 ⋯ erk)oplus hi)= RN(G W E H)
(16)
65 70 75Accuracy ()
K=1
K=2
K=3
K=4
K=5
Reasoning Hops
7324
7393
7465
7365
7416
Figure 6 Effect ofK in MHGRN We show IHDev ac-curacy of MHGRN on CommonsenseQA wrt hops
2 4 6 800
05
10
15
20
sba
tch
K Reasoning Hops
RGCNMHGRN
Figure 7 Analysis of model scalability Comparisonof per-batch training efficiency wrt hops K
63 Model ScalabilityFig 7 presents the computation cost of MHGRNand RGCN (measured by training time) Both growlinearly wrt K Although the theoretical complex-ity of MultiRGN ism times that of RGCN the ratioof their empirical cost only approaches 2 demon-strating that our model can be better parallelized
64 Model InterpretabilityWe can analyze our modelrsquos reasoning process bydecoding the reasoning path using the method de-scribed in sect45 Fig 8 shows two examples fromCommonsenseQA where our model correctly an-swers the questions and provides reasonable pathevidences In the example on the left the modellinks question entities and answer entity in a chainto support reasoning while the example on the rightprovides a case where our model leverage unmen-tioned entities to bridge the reasoning gap betweenquestion entity and answer entities in a way that iscoherent with the latent relation between CHAPEL
and the desired answer in the question
7 Related WorkKnowledge-Aware Methods for NLP Variouswork have investigated the potential to empowerNLP models with external knowledge Many at-tempt to extract structured knowledge either in theform of nodes (Yang and Mitchell 2017 Wanget al 2019) triples (Weissenborn et al 2017
Where is known for a multitude of wedding chapels
A town B texasC city D church building E Nevada
RenoChapel
PartOf
AtLocation
UsedForminus1 Synonym
AtLocation Synonym
Neveda
Baseball
Play_Baseball
HasProperty
UsedForminus1
Fun_to_Play
Why do parents encourage their kids to play baseball
A round B cheap C break window D hardE fun to play Fun_to_Play
HasProperty
Reno
Chapel
AtLocation
Neveda
PartOf
MultiGRNFigure 8 Case study on model interpretability Wepresent two sample questions from CommonsenseQAwith the reasoning paths output by MHGRN
Mihaylov and Frank 2018) paths (Bauer et al2018 Kundu et al 2019 Lin et al 2019) or sub-graphs (Li and Clark 2015) and encode them toaugment textual understanding
Recent success of pre-trained LMs motivatesmany (Pan et al 2019 Ye et al 2019 Zhang et al2018 Li et al 2019 Banerjee et al 2019) to probeLMsrsquo potential as latent knowledge bases This lineof work turn to textual knowledge (eg Wikipedia)to directly impart knowledge to pre-trained LMsThey generally fall into two paradigms 1) Fine-tuning LMs on large-scale general-domain datasets(eg RACE (Lai et al 2017)) or on knowledge-richtext 2) Providing LMs with evidence via informa-tion retrieval techniques However these modelscannot provide explicit reasoning and evidencethus hardly trustworthy They are also subject tothe availability of in-domain datasets and maxi-mum input token of pre-trained LMs
Neural Graph Encoding Graph Attention Net-works (GAT) (Velickovic et al 2018) incorpo-rates attention mechanism in feature aggregationRGCN (Schlichtkrull et al 2018) proposes rela-tional message passing which makes it applicableto multi-relational graphs However they only per-form single-hop message passing and cannot beinterpreted at path level Other work (Abu-El-Haijaet al 2019 Nikolentzos et al 2019) aggregate fora node its K-hop neighbors based on node-wisedistances but they are designed for non-relationalgraphs MHGRN addresses these issues by rea-soning on multi-relational graphs and being inter-pretable via maintaining paths as reasoning chains
8 Conclusion
We present a principled scalable method MHGRNthat can leverage general knowledge via multi-hopreasoning over interpretable structures (eg Con-ceptNet) The proposed MHGRN generalizes andcombines the advantages of GNNs and path-basedreasoning models It explicitly performs multi-hop
relational reasoning and is empirically shown tooutperform existing methods with superior scal-ablility and interpretability
ReferencesSami Abu-El-Haija Bryan Perozzi Amol Kapoor
Nazanin Alipourfard Kristina Lerman Hrayr Haru-tyunyan Greg Ver Steeg and Aram Galstyan 2019Mixhop Higher-order graph convolutional architec-tures via sparsified neighborhood mixing In Pro-ceedings of the 36th International Conference onMachine Learning ICML 2019 9-15 June 2019Long Beach California USA volume 97 of Pro-ceedings of Machine Learning Research pages 21ndash29 PMLR
Pratyay Banerjee Kuntal Kumar Pal Arindam Mi-tra and Chitta Baral 2019 Careful selection ofknowledge to solve open book question answeringIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages6120ndash6129 Florence Italy Association for Compu-tational Linguistics
Lisa Bauer Yicheng Wang and Mohit Bansal 2018Commonsense for generative multi-hop question an-swering tasks In Proceedings of the 2018 Confer-ence on Empirical Methods in Natural LanguageProcessing Brussels Belgium October 31 - Novem-ber 4 2018 pages 4220ndash4230 Association for Com-putational Linguistics
Peter Clark Isaac Cowhey Oren Etzioni Tushar KhotAshish Sabharwal Carissa Schoenick and OyvindTafjord 2018 Think you have solved questionanswering try arc the AI2 reasoning challengeCoRR abs180305457
Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics
Justin Gilmer Samuel S Schoenholz Patrick F RileyOriol Vinyals and George E Dahl 2017 Neuralmessage passing for quantum chemistry In Pro-ceedings of the 34th International Conference onMachine Learning ICML 2017 Sydney NSW Aus-tralia 6-11 August 2017 volume 70 of Proceedingsof Machine Learning Research pages 1263ndash1272PMLR
Daniel Khashabi Tushar Khot Ashish SabharwalOyvind Tafjord Peter Clark and Hannaneh Ha-jishirzi 2020 Unifiedqa Crossing format bound-aries with a single qa system
Thomas N Kipf and Max Welling 2017 Semi-supervised classification with graph convolutionalnetworks In 5th International Conference on Learn-ing Representations ICLR 2017 Toulon FranceApril 24-26 2017 Conference Track ProceedingsOpenReviewnet
Souvik Kundu Tushar Khot Ashish Sabharwal andPeter Clark 2019 Exploiting explicit paths formulti-hop reading comprehension In Proceedingsof the 57th Annual Meeting of the Association forComputational Linguistics pages 2737ndash2747 Flo-rence Italy Association for Computational Linguis-tics
John D Lafferty Andrew McCallum and FernandoC N Pereira 2001 Conditional random fieldsProbabilistic models for segmenting and labeling se-quence data In Proceedings of the Eighteenth Inter-national Conference on Machine Learning (ICML2001) Williams College Williamstown MA USAJune 28 - July 1 2001 pages 282ndash289 MorganKaufmann
Guokun Lai Qizhe Xie Hanxiao Liu Yiming Yangand Eduard Hovy 2017 RACE Large-scale ReAd-ing comprehension dataset from examinations InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing pages785ndash794 Copenhagen Denmark Association forComputational Linguistics
Zhen-Zhong Lan Mingda Chen Sebastian Good-man Kevin Gimpel Piyush Sharma and RaduSoricut 2019 Albert A lite bert for self-supervised learning of language representationsArXiv abs190911942
Shiyang Li Jianshu Chen and Dian Yu 2019 Teach-ing pretrained models with commonsense reason-ing A preliminary kb-based approach CoRRabs190909743
Yang Li and Peter Clark 2015 Answering elementaryscience questions by constructing coherent scenesusing background knowledge In Proceedings ofthe 2015 Conference on Empirical Methods in Nat-ural Language Processing pages 2007ndash2012 Lis-bon Portugal Association for Computational Lin-guistics
Bill Yuchen Lin Xinyue Chen Jamin Chen and Xi-ang Ren 2019 KagNet Knowledge-aware graphnetworks for commonsense reasoning In Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) pages 2829ndash2839 HongKong China Association for Computational Lin-guistics
Liyuan Liu Haoming Jiang Pengcheng He WeizhuChen Xiaodong Liu Jianfeng Gao and Jiawei Han2019a On the variance of the adaptive learning rateand beyond CoRR abs190803265
Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019bRoberta A robustly optimized BERT pretraining ap-proach CoRR abs190711692
Shangwen Lv Daya Guo Jingjing Xu Duyu TangNan Duan Ming Gong Linjun Shou Daxin JiangGuihong Cao and Songlin Hu 2019 Graph-based reasoning over heterogeneous external knowl-edge for commonsense question answering CoRRabs190905311
Kaixin Ma Jonathan Francis Quanyang Lu Eric Ny-berg and Alessandro Oltramari 2019 Towardsgeneralizable neuro-symbolic systems for common-sense question answering In Proceedings of theFirst Workshop on Commonsense Inference in Natu-ral Language Processing pages 22ndash32 Hong KongChina Association for Computational Linguistics
Todor Mihaylov Peter Clark Tushar Khot and AshishSabharwal 2018 Can a suit of armor conduct elec-tricity a new dataset for open book question an-swering In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processingpages 2381ndash2391 Brussels Belgium Associationfor Computational Linguistics
Todor Mihaylov and Anette Frank 2018 Knowledge-able reader Enhancing cloze-style reading compre-hension with external commonsense knowledge InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 821ndash832 Melbourne AustraliaAssociation for Computational Linguistics
Giannis Nikolentzos George Dasoulas and MichalisVazirgiannis 2019 k-hop graph neural networksCoRR abs190706051
Xiaoman Pan Kai Sun Dian Yu Heng Ji and Dong Yu2019 Improving question answering with externalknowledge CoRR abs190200993
Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2019 Exploring the limitsof transfer learning with a unified text-to-text trans-former
Adam Santoro David Raposo David G T BarrettMateusz Malinowski Razvan Pascanu Peter WBattaglia and Tim Lillicrap 2017 A simple neu-ral network module for relational reasoning In Ad-vances in Neural Information Processing Systems30 Annual Conference on Neural Information Pro-cessing Systems 2017 4-9 December 2017 LongBeach CA USA pages 4967ndash4976
Maarten Sap Hannah Rashkin Derek Chen RonanLe Bras and Yejin Choi 2019 Social IQa Com-monsense reasoning about social interactions InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the
9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) pages 4463ndash4473 Hong Kong China Association for Computa-tional Linguistics
Michael Sejr Schlichtkrull Thomas N Kipf PeterBloem Rianne van den Berg Ivan Titov and MaxWelling 2018 Modeling relational data with graphconvolutional networks In The Semantic Web - 15thInternational Conference ESWC 2018 HeraklionCrete Greece June 3-7 2018 Proceedings volume10843 of Lecture Notes in Computer Science pages593ndash607 Springer
Robyn Speer Joshua Chin and Catherine Havasi 2017Conceptnet 55 An open multilingual graph of gen-eral knowledge In Proceedings of the Thirty-FirstAAAI Conference on Artificial Intelligence Febru-ary 4-9 2017 San Francisco California USApages 4444ndash4451 AAAI Press
Alon Talmor Jonathan Herzig Nicholas Lourie andJonathan Berant 2019 CommonsenseQA A ques-tion answering challenge targeting commonsenseknowledge In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4149ndash4158 Minneapolis Minnesota Associ-ation for Computational Linguistics
Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez LukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems 30 Annual Conference on NeuralInformation Processing Systems 2017 4-9 Decem-ber 2017 Long Beach CA USA pages 5998ndash6008
Petar Velickovic Guillem Cucurull Arantxa CasanovaAdriana Romero Pietro Lio and Yoshua Bengio2018 Graph attention networks In 6th Inter-national Conference on Learning RepresentationsICLR 2018 Vancouver BC Canada April 30 - May3 2018 Conference Track Proceedings OpenRe-viewnet
Shuohang Wang and Jing Jiang 2016 Learning nat-ural language inference with LSTM In Proceed-ings of the 2016 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics Human Language Technologies pages1442ndash1451 San Diego California Association forComputational Linguistics
Xiaoyan Wang Pavan Kapanipathi Ryan MusaMo Yu Kartik Talamadupula Ibrahim AbdelazizMaria Chang Achille Fokoue Bassem MakniNicholas Mattei and Michael Witbrock 2019 Im-proving natural language inference using externalknowledge in the science questions domain In TheThirty-Third AAAI Conference on Artificial Intelli-gence AAAI 2019 The Thirty-First Innovative Ap-plications of Artificial Intelligence Conference IAAI2019 The Ninth AAAI Symposium on Educational
Advances in Artificial Intelligence EAAI 2019 Hon-olulu Hawaii USA January 27 - February 1 2019pages 7208ndash7215 AAAI Press
Dirk Weissenborn Tomas Kocisky and Chris Dyer2017 Dynamic integration of background knowl-edge in neural nlu systems arXiv preprintarXiv170602596
An Yang Quan Wang Jing Liu Kai Liu Yajuan LyuHua Wu Qiaoqiao She and Sujian Li 2019 En-hancing pre-trained language representations withrich knowledge for machine reading comprehensionIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2346ndash2357 Florence Italy Association for Compu-tational Linguistics
Bishan Yang and Tom Mitchell 2017 Leveragingknowledge bases in LSTMs for improving machinereading In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1 Long Papers) pages 1436ndash1446 Van-couver Canada Association for Computational Lin-guistics
Zhi-Xiu Ye Qian Chen Wen Wang and Zhen-HuaLing 2019 Align mask and select A sim-ple method for incorporating commonsense knowl-edge into language representation models CoRRabs190806725
Yuyu Zhang Hanjun Dai Kamil Toraman andLe Song 2018 Kgˆ2 Learning to reason scienceexam questions with contextual knowledge graphembeddings CoRR abs180512393
Chen Zhu Yu Cheng Zhe Gan Siqi Sun Tom Gold-stein and Jingjing Liu 2020 Freelb Enhanced ad-versarial training for natural language understandingIn International Conference on Learning Represen-tations
A Merging Types of Relations inConceptNet
Relation Merged Relation
AtLocationAtLocation
LocatedNear
CausesCausesCausesDesire
MotivatedByGoal
AntonymAntonym
DistinctFrom
HasSubevent
HasSubevent
HasFirstSubeventHasLastSubeventHasPrerequisite
EntailsMannerOf
IsAIsAInstanceOf
DefinedAs
PartOfPartOf
HasA
RelatedToRelatedToSimilarTo
Synonym
Table 7 Relations in ConceptNet that are being mergedin pre-processing RelationX indicates the reverse re-lation of RelationX
We merge relations that are close in semanticsas well as in the general usage of triple instancesin ConceptNet
B Dataset Split Specifications
Train Dev Test
CommonsenseQA (OF) 9 741 1 221 1 140CommonsenseQA (IH) 8 500 1 221 1 241OpenbookQA 4 957 500 500
Table 8 Numbers of instances in different datasetsplits
CommonsenseQA 7 and OpenbookQA 8 all havetheir leaderboards with training and development
7httpswwwtau-nlporgcommonsenseqa8httpsleaderboardallenaiorgopen_
book_qasubmissionspublic
set publicly available As the ground truth labelsfor CommonsenseQA are not readily available formodel analysis we take 1241 examples from of-ficial training examples as our in-house test ex-amples and regard the remaining 8500 ones asour in-house training examples (CommonsenseQA(IH))
C Implementation Details
CommonsenseQA OpenbookQA
BERT-BASE 3 times 10minus5 -
BERT-LARGE 2 times 10minus5 -
ROBERTA-LARGE 1 times 10minus5
1 times 10minus5
Table 9 Learning rate for text encoders on differentdatasets
CommonsenseQA OpenbookQA
RN 3 times 10minus4
3 times 10minus4
RGCN 1 times 10minus3
1 times 10minus3
GconAttn 3 times 10minus4
1 times 10minus4
MHGRN 1 times 10minus3
1 times 10minus3
Table 10 Learning rate for graph encoders on differentdatasets
Param
RN 399KRGCN 365KGconAttn 453KMHGRN 544K
Table 11 Numbers of parameters of different graph en-coders
Our models are implemented in PyTorch We usecross-entropy loss and RAdam (Liu et al 2019a)optimizer We find it beneficial to use separatelearning rates for the text encoder and the graph en-coder We tune learning rates for text encoders andgraph encoders on two datasets We first fine-tuneROBERTA-LARGE BERT-LARGE BERT-BASE
on CommonsenseQA and ROBERTA-LARGE onOpenbookQA respectively and choose a dataset-specific learning rate from 1times10
minus5 2times10
minus5 3times
10minus5 6 times 10
minus5 1 times 10
minus4 for each text encoderbased on the best performance on development setas listed in Table 9 We report the performance ofthese fine-tuned text encoders and also adopt theirdataset-specific optimal learning rates in joint train-ing with graph encoders For models that involve
KG the learning rate of their graph encoders arechosen from 1 times 10
minus4 3 times 10
minus4 1 times 10
minus3 3 times
10minus3 based on their best development set perfor-
mance with ROBERTA-LARGE as the text encoderWe report the optimal learning rates for graph en-coders in Table 10 In training we set the max-imum input sequence length to text encoders to64 batch size to 32 and perform early stoppingAristoRoBERTaV7+MHGRN is the only exceptionIn order to host fair comparison we follow Aris-toRoBERTaV7 and set the batch size to 16 maxinput sequence length to 256 and choose a decoderlearning rate from 1 times 10
minus3 2 times 10
minus5For the input node features we first use tem-
plates to turn knowledge triples in ConceptNet intosentences and feed them into pre-trained BERT-LARGE obtaining a sequence of token embeddingsfrom the last layer of BERT-LARGE for each tripleFor each entity in ConceptNet we perform meanpooling over the tokens of the entityrsquos occurrencesacross all the sentences to form a 1024d vector asits corresponding node feature We use this set offeatures for all our implemented models
We use 2-layer RGCN and single-layer MHGRNacross our experiments
The numbers of parameter for each graph en-coder are listed in Table 11
D Dynamic Programming Algorithm forEq 7
To show that multi-hop message passing can becomputed in linear time we observe that Eq 7 canbe re-written in matrix form
Zk= (Dk)minus1 sum
(r1rk)isinRk
β(r1 rk s)
sdotGArk⋯Ar1FXW1r1
⊤⋯W
krk
⊤
sdotWk+10
⊤⋯W
K0
⊤ (1 le k le K) (13)
where G = diag(exp([g(φ(v1) s) g(φ(vn)s)]) (F is similarly defined) Ar is the adjacencymatrix for relation r and D
k is defined as follows
Dk= diag( sum
(r1rk)isinRk
β(r1 rk s)
sdotGArk⋯Ar1FX1) (1 le k le K) (14)
Using this matrix formulation we can compute Eq7 using dynamic programming
Algorithm 1 Dynamic programming algorithm formulti-hop message passingInput sXAr(1 le r le m)W t
r (r isin R 1 le t le
k)F G δ τOutput Z
1 WKlarr I
2 for k larr K minus 1 to 1 do3 W
klarrW
k+10 W
k+1
4 end for5 for r isin R do6 Mr larr FX7 end for8 for k larr 1 to K do9 if k gt 1 then
10 for r isin R do11 M
primer larr e
δ(rs)ArsumrprimeisinR e
τ(rprimers)MrprimeW
kr
⊤
12 end for13 for r isin R do14 Mr larrM
primer
15 end for16 else17 for r isin R do18 Mr larr e
δ(rs) sdotArMrWkr
⊤
19 end for20 end if21 Z
klarr GsumrisinR MrW
k
22 end for23 Replace W
tr (0 le r le m 1 le t le k) with identity
matrices and X with 1 and re-run line 1 - line 19 tocompute d
1 d
K
24 for k larr 1 to K do25 Z
klarr (diag(dk))minus1
Zk
26 end for27 return Z
1Z
2 Z
k
E Formal Definition of K-hop RN
Definition 1 (K-hop Relation Network) A multi-hop relation network is a function that maps amulti-relational graph to a fixed size vector
KHopRN(G W E H) =K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(j r1 rk i)sdotW (hjoplus(er1⋯erk)oplushi)(15)
where denotes element-wise product and β(⋯) =1(K∣A∣ sdot ∣(j prime i) isin G ∣ j prime isin Q∣) defines thepooling weights
F Expressing K-hop RN with MultiGRN
Theorem 1 Given any W E H there exists a pa-rameter setting such that the output of the modelbecomes KHopRN(GW E H) for arbitrary G
Proof Suppose W = [W1 W2 W3] whereW1 W3 isin Rd3timesd1 W2 isin Rd3timesd2 ForMHRGN we set the parameters as followsH = HUlowast = [I0] isin R(d1+d2)timesd1 blowast =
[01]⊤ isin Rd1+d2 Wtr = diag(1 oplus er) isin
R(d1+d2)times(d1+d2)(r isin R 1 le t le K)V = W3 isin
Rd3timesd1 Vprime= [W1 W2] isin Rd3times(d1+d2) We dis-
able the relation type attention module and enablemessage passing only from Q to A By furtherchoosing σ as the identity function and perform-ing pooling over A we observe that the output of
MultiGRN becomes
1
∣A∣ sumiisinA
hprimei
=1
∣A∣ (V hi + Vprimezi)
=1
K∣A∣K
sumk=1
(V hi + Vprimezki )
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(j r1 rk i)(V hi+
VprimeW
krk⋯W
1r1xj)
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ
β(⋯)(V hi + VprimeW
krk
⋯W1r1Uφ(j)hj + V
primeW
krk⋯W
1r1bφ(j))
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(⋯)(W3hi + W1hj
+ W2(er1 ⋯ erk))
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(⋯)W (hjoplus
(er1 ⋯ erk)oplus hi)= RN(G W E H)
(16)
relational reasoning and is empirically shown tooutperform existing methods with superior scal-ablility and interpretability
ReferencesSami Abu-El-Haija Bryan Perozzi Amol Kapoor
Nazanin Alipourfard Kristina Lerman Hrayr Haru-tyunyan Greg Ver Steeg and Aram Galstyan 2019Mixhop Higher-order graph convolutional architec-tures via sparsified neighborhood mixing In Pro-ceedings of the 36th International Conference onMachine Learning ICML 2019 9-15 June 2019Long Beach California USA volume 97 of Pro-ceedings of Machine Learning Research pages 21ndash29 PMLR
Pratyay Banerjee Kuntal Kumar Pal Arindam Mi-tra and Chitta Baral 2019 Careful selection ofknowledge to solve open book question answeringIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages6120ndash6129 Florence Italy Association for Compu-tational Linguistics
Lisa Bauer Yicheng Wang and Mohit Bansal 2018Commonsense for generative multi-hop question an-swering tasks In Proceedings of the 2018 Confer-ence on Empirical Methods in Natural LanguageProcessing Brussels Belgium October 31 - Novem-ber 4 2018 pages 4220ndash4230 Association for Com-putational Linguistics
Peter Clark Isaac Cowhey Oren Etzioni Tushar KhotAshish Sabharwal Carissa Schoenick and OyvindTafjord 2018 Think you have solved questionanswering try arc the AI2 reasoning challengeCoRR abs180305457
Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics
Justin Gilmer Samuel S Schoenholz Patrick F RileyOriol Vinyals and George E Dahl 2017 Neuralmessage passing for quantum chemistry In Pro-ceedings of the 34th International Conference onMachine Learning ICML 2017 Sydney NSW Aus-tralia 6-11 August 2017 volume 70 of Proceedingsof Machine Learning Research pages 1263ndash1272PMLR
Daniel Khashabi Tushar Khot Ashish SabharwalOyvind Tafjord Peter Clark and Hannaneh Ha-jishirzi 2020 Unifiedqa Crossing format bound-aries with a single qa system
Thomas N Kipf and Max Welling 2017 Semi-supervised classification with graph convolutionalnetworks In 5th International Conference on Learn-ing Representations ICLR 2017 Toulon FranceApril 24-26 2017 Conference Track ProceedingsOpenReviewnet
Souvik Kundu Tushar Khot Ashish Sabharwal andPeter Clark 2019 Exploiting explicit paths formulti-hop reading comprehension In Proceedingsof the 57th Annual Meeting of the Association forComputational Linguistics pages 2737ndash2747 Flo-rence Italy Association for Computational Linguis-tics
John D Lafferty Andrew McCallum and FernandoC N Pereira 2001 Conditional random fieldsProbabilistic models for segmenting and labeling se-quence data In Proceedings of the Eighteenth Inter-national Conference on Machine Learning (ICML2001) Williams College Williamstown MA USAJune 28 - July 1 2001 pages 282ndash289 MorganKaufmann
Guokun Lai Qizhe Xie Hanxiao Liu Yiming Yangand Eduard Hovy 2017 RACE Large-scale ReAd-ing comprehension dataset from examinations InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing pages785ndash794 Copenhagen Denmark Association forComputational Linguistics
Zhen-Zhong Lan Mingda Chen Sebastian Good-man Kevin Gimpel Piyush Sharma and RaduSoricut 2019 Albert A lite bert for self-supervised learning of language representationsArXiv abs190911942
Shiyang Li Jianshu Chen and Dian Yu 2019 Teach-ing pretrained models with commonsense reason-ing A preliminary kb-based approach CoRRabs190909743
Yang Li and Peter Clark 2015 Answering elementaryscience questions by constructing coherent scenesusing background knowledge In Proceedings ofthe 2015 Conference on Empirical Methods in Nat-ural Language Processing pages 2007ndash2012 Lis-bon Portugal Association for Computational Lin-guistics
Bill Yuchen Lin Xinyue Chen Jamin Chen and Xi-ang Ren 2019 KagNet Knowledge-aware graphnetworks for commonsense reasoning In Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) pages 2829ndash2839 HongKong China Association for Computational Lin-guistics
Liyuan Liu Haoming Jiang Pengcheng He WeizhuChen Xiaodong Liu Jianfeng Gao and Jiawei Han2019a On the variance of the adaptive learning rateand beyond CoRR abs190803265
Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019bRoberta A robustly optimized BERT pretraining ap-proach CoRR abs190711692
Shangwen Lv Daya Guo Jingjing Xu Duyu TangNan Duan Ming Gong Linjun Shou Daxin JiangGuihong Cao and Songlin Hu 2019 Graph-based reasoning over heterogeneous external knowl-edge for commonsense question answering CoRRabs190905311
Kaixin Ma Jonathan Francis Quanyang Lu Eric Ny-berg and Alessandro Oltramari 2019 Towardsgeneralizable neuro-symbolic systems for common-sense question answering In Proceedings of theFirst Workshop on Commonsense Inference in Natu-ral Language Processing pages 22ndash32 Hong KongChina Association for Computational Linguistics
Todor Mihaylov Peter Clark Tushar Khot and AshishSabharwal 2018 Can a suit of armor conduct elec-tricity a new dataset for open book question an-swering In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processingpages 2381ndash2391 Brussels Belgium Associationfor Computational Linguistics
Todor Mihaylov and Anette Frank 2018 Knowledge-able reader Enhancing cloze-style reading compre-hension with external commonsense knowledge InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 821ndash832 Melbourne AustraliaAssociation for Computational Linguistics
Giannis Nikolentzos George Dasoulas and MichalisVazirgiannis 2019 k-hop graph neural networksCoRR abs190706051
Xiaoman Pan Kai Sun Dian Yu Heng Ji and Dong Yu2019 Improving question answering with externalknowledge CoRR abs190200993
Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2019 Exploring the limitsof transfer learning with a unified text-to-text trans-former
Adam Santoro David Raposo David G T BarrettMateusz Malinowski Razvan Pascanu Peter WBattaglia and Tim Lillicrap 2017 A simple neu-ral network module for relational reasoning In Ad-vances in Neural Information Processing Systems30 Annual Conference on Neural Information Pro-cessing Systems 2017 4-9 December 2017 LongBeach CA USA pages 4967ndash4976
Maarten Sap Hannah Rashkin Derek Chen RonanLe Bras and Yejin Choi 2019 Social IQa Com-monsense reasoning about social interactions InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the
9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) pages 4463ndash4473 Hong Kong China Association for Computa-tional Linguistics
Michael Sejr Schlichtkrull Thomas N Kipf PeterBloem Rianne van den Berg Ivan Titov and MaxWelling 2018 Modeling relational data with graphconvolutional networks In The Semantic Web - 15thInternational Conference ESWC 2018 HeraklionCrete Greece June 3-7 2018 Proceedings volume10843 of Lecture Notes in Computer Science pages593ndash607 Springer
Robyn Speer Joshua Chin and Catherine Havasi 2017Conceptnet 55 An open multilingual graph of gen-eral knowledge In Proceedings of the Thirty-FirstAAAI Conference on Artificial Intelligence Febru-ary 4-9 2017 San Francisco California USApages 4444ndash4451 AAAI Press
Alon Talmor Jonathan Herzig Nicholas Lourie andJonathan Berant 2019 CommonsenseQA A ques-tion answering challenge targeting commonsenseknowledge In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4149ndash4158 Minneapolis Minnesota Associ-ation for Computational Linguistics
Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez LukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems 30 Annual Conference on NeuralInformation Processing Systems 2017 4-9 Decem-ber 2017 Long Beach CA USA pages 5998ndash6008
Petar Velickovic Guillem Cucurull Arantxa CasanovaAdriana Romero Pietro Lio and Yoshua Bengio2018 Graph attention networks In 6th Inter-national Conference on Learning RepresentationsICLR 2018 Vancouver BC Canada April 30 - May3 2018 Conference Track Proceedings OpenRe-viewnet
Shuohang Wang and Jing Jiang 2016 Learning nat-ural language inference with LSTM In Proceed-ings of the 2016 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics Human Language Technologies pages1442ndash1451 San Diego California Association forComputational Linguistics
Xiaoyan Wang Pavan Kapanipathi Ryan MusaMo Yu Kartik Talamadupula Ibrahim AbdelazizMaria Chang Achille Fokoue Bassem MakniNicholas Mattei and Michael Witbrock 2019 Im-proving natural language inference using externalknowledge in the science questions domain In TheThirty-Third AAAI Conference on Artificial Intelli-gence AAAI 2019 The Thirty-First Innovative Ap-plications of Artificial Intelligence Conference IAAI2019 The Ninth AAAI Symposium on Educational
Advances in Artificial Intelligence EAAI 2019 Hon-olulu Hawaii USA January 27 - February 1 2019pages 7208ndash7215 AAAI Press
Dirk Weissenborn Tomas Kocisky and Chris Dyer2017 Dynamic integration of background knowl-edge in neural nlu systems arXiv preprintarXiv170602596
An Yang Quan Wang Jing Liu Kai Liu Yajuan LyuHua Wu Qiaoqiao She and Sujian Li 2019 En-hancing pre-trained language representations withrich knowledge for machine reading comprehensionIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2346ndash2357 Florence Italy Association for Compu-tational Linguistics
Bishan Yang and Tom Mitchell 2017 Leveragingknowledge bases in LSTMs for improving machinereading In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1 Long Papers) pages 1436ndash1446 Van-couver Canada Association for Computational Lin-guistics
Zhi-Xiu Ye Qian Chen Wen Wang and Zhen-HuaLing 2019 Align mask and select A sim-ple method for incorporating commonsense knowl-edge into language representation models CoRRabs190806725
Yuyu Zhang Hanjun Dai Kamil Toraman andLe Song 2018 Kgˆ2 Learning to reason scienceexam questions with contextual knowledge graphembeddings CoRR abs180512393
Chen Zhu Yu Cheng Zhe Gan Siqi Sun Tom Gold-stein and Jingjing Liu 2020 Freelb Enhanced ad-versarial training for natural language understandingIn International Conference on Learning Represen-tations
A Merging Types of Relations inConceptNet
Relation Merged Relation
AtLocationAtLocation
LocatedNear
CausesCausesCausesDesire
MotivatedByGoal
AntonymAntonym
DistinctFrom
HasSubevent
HasSubevent
HasFirstSubeventHasLastSubeventHasPrerequisite
EntailsMannerOf
IsAIsAInstanceOf
DefinedAs
PartOfPartOf
HasA
RelatedToRelatedToSimilarTo
Synonym
Table 7 Relations in ConceptNet that are being mergedin pre-processing RelationX indicates the reverse re-lation of RelationX
We merge relations that are close in semanticsas well as in the general usage of triple instancesin ConceptNet
B Dataset Split Specifications
Train Dev Test
CommonsenseQA (OF) 9 741 1 221 1 140CommonsenseQA (IH) 8 500 1 221 1 241OpenbookQA 4 957 500 500
Table 8 Numbers of instances in different datasetsplits
CommonsenseQA 7 and OpenbookQA 8 all havetheir leaderboards with training and development
7httpswwwtau-nlporgcommonsenseqa8httpsleaderboardallenaiorgopen_
book_qasubmissionspublic
set publicly available As the ground truth labelsfor CommonsenseQA are not readily available formodel analysis we take 1241 examples from of-ficial training examples as our in-house test ex-amples and regard the remaining 8500 ones asour in-house training examples (CommonsenseQA(IH))
C Implementation Details
CommonsenseQA OpenbookQA
BERT-BASE 3 times 10minus5 -
BERT-LARGE 2 times 10minus5 -
ROBERTA-LARGE 1 times 10minus5
1 times 10minus5
Table 9 Learning rate for text encoders on differentdatasets
CommonsenseQA OpenbookQA
RN 3 times 10minus4
3 times 10minus4
RGCN 1 times 10minus3
1 times 10minus3
GconAttn 3 times 10minus4
1 times 10minus4
MHGRN 1 times 10minus3
1 times 10minus3
Table 10 Learning rate for graph encoders on differentdatasets
Param
RN 399KRGCN 365KGconAttn 453KMHGRN 544K
Table 11 Numbers of parameters of different graph en-coders
Our models are implemented in PyTorch We usecross-entropy loss and RAdam (Liu et al 2019a)optimizer We find it beneficial to use separatelearning rates for the text encoder and the graph en-coder We tune learning rates for text encoders andgraph encoders on two datasets We first fine-tuneROBERTA-LARGE BERT-LARGE BERT-BASE
on CommonsenseQA and ROBERTA-LARGE onOpenbookQA respectively and choose a dataset-specific learning rate from 1times10
minus5 2times10
minus5 3times
10minus5 6 times 10
minus5 1 times 10
minus4 for each text encoderbased on the best performance on development setas listed in Table 9 We report the performance ofthese fine-tuned text encoders and also adopt theirdataset-specific optimal learning rates in joint train-ing with graph encoders For models that involve
KG the learning rate of their graph encoders arechosen from 1 times 10
minus4 3 times 10
minus4 1 times 10
minus3 3 times
10minus3 based on their best development set perfor-
mance with ROBERTA-LARGE as the text encoderWe report the optimal learning rates for graph en-coders in Table 10 In training we set the max-imum input sequence length to text encoders to64 batch size to 32 and perform early stoppingAristoRoBERTaV7+MHGRN is the only exceptionIn order to host fair comparison we follow Aris-toRoBERTaV7 and set the batch size to 16 maxinput sequence length to 256 and choose a decoderlearning rate from 1 times 10
minus3 2 times 10
minus5For the input node features we first use tem-
plates to turn knowledge triples in ConceptNet intosentences and feed them into pre-trained BERT-LARGE obtaining a sequence of token embeddingsfrom the last layer of BERT-LARGE for each tripleFor each entity in ConceptNet we perform meanpooling over the tokens of the entityrsquos occurrencesacross all the sentences to form a 1024d vector asits corresponding node feature We use this set offeatures for all our implemented models
We use 2-layer RGCN and single-layer MHGRNacross our experiments
The numbers of parameter for each graph en-coder are listed in Table 11
D Dynamic Programming Algorithm forEq 7
To show that multi-hop message passing can becomputed in linear time we observe that Eq 7 canbe re-written in matrix form
Zk= (Dk)minus1 sum
(r1rk)isinRk
β(r1 rk s)
sdotGArk⋯Ar1FXW1r1
⊤⋯W
krk
⊤
sdotWk+10
⊤⋯W
K0
⊤ (1 le k le K) (13)
where G = diag(exp([g(φ(v1) s) g(φ(vn)s)]) (F is similarly defined) Ar is the adjacencymatrix for relation r and D
k is defined as follows
Dk= diag( sum
(r1rk)isinRk
β(r1 rk s)
sdotGArk⋯Ar1FX1) (1 le k le K) (14)
Using this matrix formulation we can compute Eq7 using dynamic programming
Algorithm 1 Dynamic programming algorithm formulti-hop message passingInput sXAr(1 le r le m)W t
r (r isin R 1 le t le
k)F G δ τOutput Z
1 WKlarr I
2 for k larr K minus 1 to 1 do3 W
klarrW
k+10 W
k+1
4 end for5 for r isin R do6 Mr larr FX7 end for8 for k larr 1 to K do9 if k gt 1 then
10 for r isin R do11 M
primer larr e
δ(rs)ArsumrprimeisinR e
τ(rprimers)MrprimeW
kr
⊤
12 end for13 for r isin R do14 Mr larrM
primer
15 end for16 else17 for r isin R do18 Mr larr e
δ(rs) sdotArMrWkr
⊤
19 end for20 end if21 Z
klarr GsumrisinR MrW
k
22 end for23 Replace W
tr (0 le r le m 1 le t le k) with identity
matrices and X with 1 and re-run line 1 - line 19 tocompute d
1 d
K
24 for k larr 1 to K do25 Z
klarr (diag(dk))minus1
Zk
26 end for27 return Z
1Z
2 Z
k
E Formal Definition of K-hop RN
Definition 1 (K-hop Relation Network) A multi-hop relation network is a function that maps amulti-relational graph to a fixed size vector
KHopRN(G W E H) =K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(j r1 rk i)sdotW (hjoplus(er1⋯erk)oplushi)(15)
where denotes element-wise product and β(⋯) =1(K∣A∣ sdot ∣(j prime i) isin G ∣ j prime isin Q∣) defines thepooling weights
F Expressing K-hop RN with MultiGRN
Theorem 1 Given any W E H there exists a pa-rameter setting such that the output of the modelbecomes KHopRN(GW E H) for arbitrary G
Proof Suppose W = [W1 W2 W3] whereW1 W3 isin Rd3timesd1 W2 isin Rd3timesd2 ForMHRGN we set the parameters as followsH = HUlowast = [I0] isin R(d1+d2)timesd1 blowast =
[01]⊤ isin Rd1+d2 Wtr = diag(1 oplus er) isin
R(d1+d2)times(d1+d2)(r isin R 1 le t le K)V = W3 isin
Rd3timesd1 Vprime= [W1 W2] isin Rd3times(d1+d2) We dis-
able the relation type attention module and enablemessage passing only from Q to A By furtherchoosing σ as the identity function and perform-ing pooling over A we observe that the output of
MultiGRN becomes
1
∣A∣ sumiisinA
hprimei
=1
∣A∣ (V hi + Vprimezi)
=1
K∣A∣K
sumk=1
(V hi + Vprimezki )
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(j r1 rk i)(V hi+
VprimeW
krk⋯W
1r1xj)
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ
β(⋯)(V hi + VprimeW
krk
⋯W1r1Uφ(j)hj + V
primeW
krk⋯W
1r1bφ(j))
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(⋯)(W3hi + W1hj
+ W2(er1 ⋯ erk))
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(⋯)W (hjoplus
(er1 ⋯ erk)oplus hi)= RN(G W E H)
(16)
Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019bRoberta A robustly optimized BERT pretraining ap-proach CoRR abs190711692
Shangwen Lv Daya Guo Jingjing Xu Duyu TangNan Duan Ming Gong Linjun Shou Daxin JiangGuihong Cao and Songlin Hu 2019 Graph-based reasoning over heterogeneous external knowl-edge for commonsense question answering CoRRabs190905311
Kaixin Ma Jonathan Francis Quanyang Lu Eric Ny-berg and Alessandro Oltramari 2019 Towardsgeneralizable neuro-symbolic systems for common-sense question answering In Proceedings of theFirst Workshop on Commonsense Inference in Natu-ral Language Processing pages 22ndash32 Hong KongChina Association for Computational Linguistics
Todor Mihaylov Peter Clark Tushar Khot and AshishSabharwal 2018 Can a suit of armor conduct elec-tricity a new dataset for open book question an-swering In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processingpages 2381ndash2391 Brussels Belgium Associationfor Computational Linguistics
Todor Mihaylov and Anette Frank 2018 Knowledge-able reader Enhancing cloze-style reading compre-hension with external commonsense knowledge InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 821ndash832 Melbourne AustraliaAssociation for Computational Linguistics
Giannis Nikolentzos George Dasoulas and MichalisVazirgiannis 2019 k-hop graph neural networksCoRR abs190706051
Xiaoman Pan Kai Sun Dian Yu Heng Ji and Dong Yu2019 Improving question answering with externalknowledge CoRR abs190200993
Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2019 Exploring the limitsof transfer learning with a unified text-to-text trans-former
Adam Santoro David Raposo David G T BarrettMateusz Malinowski Razvan Pascanu Peter WBattaglia and Tim Lillicrap 2017 A simple neu-ral network module for relational reasoning In Ad-vances in Neural Information Processing Systems30 Annual Conference on Neural Information Pro-cessing Systems 2017 4-9 December 2017 LongBeach CA USA pages 4967ndash4976
Maarten Sap Hannah Rashkin Derek Chen RonanLe Bras and Yejin Choi 2019 Social IQa Com-monsense reasoning about social interactions InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the
9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) pages 4463ndash4473 Hong Kong China Association for Computa-tional Linguistics
Michael Sejr Schlichtkrull Thomas N Kipf PeterBloem Rianne van den Berg Ivan Titov and MaxWelling 2018 Modeling relational data with graphconvolutional networks In The Semantic Web - 15thInternational Conference ESWC 2018 HeraklionCrete Greece June 3-7 2018 Proceedings volume10843 of Lecture Notes in Computer Science pages593ndash607 Springer
Robyn Speer Joshua Chin and Catherine Havasi 2017Conceptnet 55 An open multilingual graph of gen-eral knowledge In Proceedings of the Thirty-FirstAAAI Conference on Artificial Intelligence Febru-ary 4-9 2017 San Francisco California USApages 4444ndash4451 AAAI Press
Alon Talmor Jonathan Herzig Nicholas Lourie andJonathan Berant 2019 CommonsenseQA A ques-tion answering challenge targeting commonsenseknowledge In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4149ndash4158 Minneapolis Minnesota Associ-ation for Computational Linguistics
Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez LukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems 30 Annual Conference on NeuralInformation Processing Systems 2017 4-9 Decem-ber 2017 Long Beach CA USA pages 5998ndash6008
Petar Velickovic Guillem Cucurull Arantxa CasanovaAdriana Romero Pietro Lio and Yoshua Bengio2018 Graph attention networks In 6th Inter-national Conference on Learning RepresentationsICLR 2018 Vancouver BC Canada April 30 - May3 2018 Conference Track Proceedings OpenRe-viewnet
Shuohang Wang and Jing Jiang 2016 Learning nat-ural language inference with LSTM In Proceed-ings of the 2016 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics Human Language Technologies pages1442ndash1451 San Diego California Association forComputational Linguistics
Xiaoyan Wang Pavan Kapanipathi Ryan MusaMo Yu Kartik Talamadupula Ibrahim AbdelazizMaria Chang Achille Fokoue Bassem MakniNicholas Mattei and Michael Witbrock 2019 Im-proving natural language inference using externalknowledge in the science questions domain In TheThirty-Third AAAI Conference on Artificial Intelli-gence AAAI 2019 The Thirty-First Innovative Ap-plications of Artificial Intelligence Conference IAAI2019 The Ninth AAAI Symposium on Educational
Advances in Artificial Intelligence EAAI 2019 Hon-olulu Hawaii USA January 27 - February 1 2019pages 7208ndash7215 AAAI Press
Dirk Weissenborn Tomas Kocisky and Chris Dyer2017 Dynamic integration of background knowl-edge in neural nlu systems arXiv preprintarXiv170602596
An Yang Quan Wang Jing Liu Kai Liu Yajuan LyuHua Wu Qiaoqiao She and Sujian Li 2019 En-hancing pre-trained language representations withrich knowledge for machine reading comprehensionIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2346ndash2357 Florence Italy Association for Compu-tational Linguistics
Bishan Yang and Tom Mitchell 2017 Leveragingknowledge bases in LSTMs for improving machinereading In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1 Long Papers) pages 1436ndash1446 Van-couver Canada Association for Computational Lin-guistics
Zhi-Xiu Ye Qian Chen Wen Wang and Zhen-HuaLing 2019 Align mask and select A sim-ple method for incorporating commonsense knowl-edge into language representation models CoRRabs190806725
Yuyu Zhang Hanjun Dai Kamil Toraman andLe Song 2018 Kgˆ2 Learning to reason scienceexam questions with contextual knowledge graphembeddings CoRR abs180512393
Chen Zhu Yu Cheng Zhe Gan Siqi Sun Tom Gold-stein and Jingjing Liu 2020 Freelb Enhanced ad-versarial training for natural language understandingIn International Conference on Learning Represen-tations
A Merging Types of Relations inConceptNet
Relation Merged Relation
AtLocationAtLocation
LocatedNear
CausesCausesCausesDesire
MotivatedByGoal
AntonymAntonym
DistinctFrom
HasSubevent
HasSubevent
HasFirstSubeventHasLastSubeventHasPrerequisite
EntailsMannerOf
IsAIsAInstanceOf
DefinedAs
PartOfPartOf
HasA
RelatedToRelatedToSimilarTo
Synonym
Table 7 Relations in ConceptNet that are being mergedin pre-processing RelationX indicates the reverse re-lation of RelationX
We merge relations that are close in semanticsas well as in the general usage of triple instancesin ConceptNet
B Dataset Split Specifications
Train Dev Test
CommonsenseQA (OF) 9 741 1 221 1 140CommonsenseQA (IH) 8 500 1 221 1 241OpenbookQA 4 957 500 500
Table 8 Numbers of instances in different datasetsplits
CommonsenseQA 7 and OpenbookQA 8 all havetheir leaderboards with training and development
7httpswwwtau-nlporgcommonsenseqa8httpsleaderboardallenaiorgopen_
book_qasubmissionspublic
set publicly available As the ground truth labelsfor CommonsenseQA are not readily available formodel analysis we take 1241 examples from of-ficial training examples as our in-house test ex-amples and regard the remaining 8500 ones asour in-house training examples (CommonsenseQA(IH))
C Implementation Details
CommonsenseQA OpenbookQA
BERT-BASE 3 times 10minus5 -
BERT-LARGE 2 times 10minus5 -
ROBERTA-LARGE 1 times 10minus5
1 times 10minus5
Table 9 Learning rate for text encoders on differentdatasets
CommonsenseQA OpenbookQA
RN 3 times 10minus4
3 times 10minus4
RGCN 1 times 10minus3
1 times 10minus3
GconAttn 3 times 10minus4
1 times 10minus4
MHGRN 1 times 10minus3
1 times 10minus3
Table 10 Learning rate for graph encoders on differentdatasets
Param
RN 399KRGCN 365KGconAttn 453KMHGRN 544K
Table 11 Numbers of parameters of different graph en-coders
Our models are implemented in PyTorch We usecross-entropy loss and RAdam (Liu et al 2019a)optimizer We find it beneficial to use separatelearning rates for the text encoder and the graph en-coder We tune learning rates for text encoders andgraph encoders on two datasets We first fine-tuneROBERTA-LARGE BERT-LARGE BERT-BASE
on CommonsenseQA and ROBERTA-LARGE onOpenbookQA respectively and choose a dataset-specific learning rate from 1times10
minus5 2times10
minus5 3times
10minus5 6 times 10
minus5 1 times 10
minus4 for each text encoderbased on the best performance on development setas listed in Table 9 We report the performance ofthese fine-tuned text encoders and also adopt theirdataset-specific optimal learning rates in joint train-ing with graph encoders For models that involve
KG the learning rate of their graph encoders arechosen from 1 times 10
minus4 3 times 10
minus4 1 times 10
minus3 3 times
10minus3 based on their best development set perfor-
mance with ROBERTA-LARGE as the text encoderWe report the optimal learning rates for graph en-coders in Table 10 In training we set the max-imum input sequence length to text encoders to64 batch size to 32 and perform early stoppingAristoRoBERTaV7+MHGRN is the only exceptionIn order to host fair comparison we follow Aris-toRoBERTaV7 and set the batch size to 16 maxinput sequence length to 256 and choose a decoderlearning rate from 1 times 10
minus3 2 times 10
minus5For the input node features we first use tem-
plates to turn knowledge triples in ConceptNet intosentences and feed them into pre-trained BERT-LARGE obtaining a sequence of token embeddingsfrom the last layer of BERT-LARGE for each tripleFor each entity in ConceptNet we perform meanpooling over the tokens of the entityrsquos occurrencesacross all the sentences to form a 1024d vector asits corresponding node feature We use this set offeatures for all our implemented models
We use 2-layer RGCN and single-layer MHGRNacross our experiments
The numbers of parameter for each graph en-coder are listed in Table 11
D Dynamic Programming Algorithm forEq 7
To show that multi-hop message passing can becomputed in linear time we observe that Eq 7 canbe re-written in matrix form
Zk= (Dk)minus1 sum
(r1rk)isinRk
β(r1 rk s)
sdotGArk⋯Ar1FXW1r1
⊤⋯W
krk
⊤
sdotWk+10
⊤⋯W
K0
⊤ (1 le k le K) (13)
where G = diag(exp([g(φ(v1) s) g(φ(vn)s)]) (F is similarly defined) Ar is the adjacencymatrix for relation r and D
k is defined as follows
Dk= diag( sum
(r1rk)isinRk
β(r1 rk s)
sdotGArk⋯Ar1FX1) (1 le k le K) (14)
Using this matrix formulation we can compute Eq7 using dynamic programming
Algorithm 1 Dynamic programming algorithm formulti-hop message passingInput sXAr(1 le r le m)W t
r (r isin R 1 le t le
k)F G δ τOutput Z
1 WKlarr I
2 for k larr K minus 1 to 1 do3 W
klarrW
k+10 W
k+1
4 end for5 for r isin R do6 Mr larr FX7 end for8 for k larr 1 to K do9 if k gt 1 then
10 for r isin R do11 M
primer larr e
δ(rs)ArsumrprimeisinR e
τ(rprimers)MrprimeW
kr
⊤
12 end for13 for r isin R do14 Mr larrM
primer
15 end for16 else17 for r isin R do18 Mr larr e
δ(rs) sdotArMrWkr
⊤
19 end for20 end if21 Z
klarr GsumrisinR MrW
k
22 end for23 Replace W
tr (0 le r le m 1 le t le k) with identity
matrices and X with 1 and re-run line 1 - line 19 tocompute d
1 d
K
24 for k larr 1 to K do25 Z
klarr (diag(dk))minus1
Zk
26 end for27 return Z
1Z
2 Z
k
E Formal Definition of K-hop RN
Definition 1 (K-hop Relation Network) A multi-hop relation network is a function that maps amulti-relational graph to a fixed size vector
KHopRN(G W E H) =K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(j r1 rk i)sdotW (hjoplus(er1⋯erk)oplushi)(15)
where denotes element-wise product and β(⋯) =1(K∣A∣ sdot ∣(j prime i) isin G ∣ j prime isin Q∣) defines thepooling weights
F Expressing K-hop RN with MultiGRN
Theorem 1 Given any W E H there exists a pa-rameter setting such that the output of the modelbecomes KHopRN(GW E H) for arbitrary G
Proof Suppose W = [W1 W2 W3] whereW1 W3 isin Rd3timesd1 W2 isin Rd3timesd2 ForMHRGN we set the parameters as followsH = HUlowast = [I0] isin R(d1+d2)timesd1 blowast =
[01]⊤ isin Rd1+d2 Wtr = diag(1 oplus er) isin
R(d1+d2)times(d1+d2)(r isin R 1 le t le K)V = W3 isin
Rd3timesd1 Vprime= [W1 W2] isin Rd3times(d1+d2) We dis-
able the relation type attention module and enablemessage passing only from Q to A By furtherchoosing σ as the identity function and perform-ing pooling over A we observe that the output of
MultiGRN becomes
1
∣A∣ sumiisinA
hprimei
=1
∣A∣ (V hi + Vprimezi)
=1
K∣A∣K
sumk=1
(V hi + Vprimezki )
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(j r1 rk i)(V hi+
VprimeW
krk⋯W
1r1xj)
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ
β(⋯)(V hi + VprimeW
krk
⋯W1r1Uφ(j)hj + V
primeW
krk⋯W
1r1bφ(j))
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(⋯)(W3hi + W1hj
+ W2(er1 ⋯ erk))
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(⋯)W (hjoplus
(er1 ⋯ erk)oplus hi)= RN(G W E H)
(16)
Advances in Artificial Intelligence EAAI 2019 Hon-olulu Hawaii USA January 27 - February 1 2019pages 7208ndash7215 AAAI Press
Dirk Weissenborn Tomas Kocisky and Chris Dyer2017 Dynamic integration of background knowl-edge in neural nlu systems arXiv preprintarXiv170602596
An Yang Quan Wang Jing Liu Kai Liu Yajuan LyuHua Wu Qiaoqiao She and Sujian Li 2019 En-hancing pre-trained language representations withrich knowledge for machine reading comprehensionIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2346ndash2357 Florence Italy Association for Compu-tational Linguistics
Bishan Yang and Tom Mitchell 2017 Leveragingknowledge bases in LSTMs for improving machinereading In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1 Long Papers) pages 1436ndash1446 Van-couver Canada Association for Computational Lin-guistics
Zhi-Xiu Ye Qian Chen Wen Wang and Zhen-HuaLing 2019 Align mask and select A sim-ple method for incorporating commonsense knowl-edge into language representation models CoRRabs190806725
Yuyu Zhang Hanjun Dai Kamil Toraman andLe Song 2018 Kgˆ2 Learning to reason scienceexam questions with contextual knowledge graphembeddings CoRR abs180512393
Chen Zhu Yu Cheng Zhe Gan Siqi Sun Tom Gold-stein and Jingjing Liu 2020 Freelb Enhanced ad-versarial training for natural language understandingIn International Conference on Learning Represen-tations
A Merging Types of Relations inConceptNet
Relation Merged Relation
AtLocationAtLocation
LocatedNear
CausesCausesCausesDesire
MotivatedByGoal
AntonymAntonym
DistinctFrom
HasSubevent
HasSubevent
HasFirstSubeventHasLastSubeventHasPrerequisite
EntailsMannerOf
IsAIsAInstanceOf
DefinedAs
PartOfPartOf
HasA
RelatedToRelatedToSimilarTo
Synonym
Table 7 Relations in ConceptNet that are being mergedin pre-processing RelationX indicates the reverse re-lation of RelationX
We merge relations that are close in semanticsas well as in the general usage of triple instancesin ConceptNet
B Dataset Split Specifications
Train Dev Test
CommonsenseQA (OF) 9 741 1 221 1 140CommonsenseQA (IH) 8 500 1 221 1 241OpenbookQA 4 957 500 500
Table 8 Numbers of instances in different datasetsplits
CommonsenseQA 7 and OpenbookQA 8 all havetheir leaderboards with training and development
7httpswwwtau-nlporgcommonsenseqa8httpsleaderboardallenaiorgopen_
book_qasubmissionspublic
set publicly available As the ground truth labelsfor CommonsenseQA are not readily available formodel analysis we take 1241 examples from of-ficial training examples as our in-house test ex-amples and regard the remaining 8500 ones asour in-house training examples (CommonsenseQA(IH))
C Implementation Details
CommonsenseQA OpenbookQA
BERT-BASE 3 times 10minus5 -
BERT-LARGE 2 times 10minus5 -
ROBERTA-LARGE 1 times 10minus5
1 times 10minus5
Table 9 Learning rate for text encoders on differentdatasets
CommonsenseQA OpenbookQA
RN 3 times 10minus4
3 times 10minus4
RGCN 1 times 10minus3
1 times 10minus3
GconAttn 3 times 10minus4
1 times 10minus4
MHGRN 1 times 10minus3
1 times 10minus3
Table 10 Learning rate for graph encoders on differentdatasets
Param
RN 399KRGCN 365KGconAttn 453KMHGRN 544K
Table 11 Numbers of parameters of different graph en-coders
Our models are implemented in PyTorch We usecross-entropy loss and RAdam (Liu et al 2019a)optimizer We find it beneficial to use separatelearning rates for the text encoder and the graph en-coder We tune learning rates for text encoders andgraph encoders on two datasets We first fine-tuneROBERTA-LARGE BERT-LARGE BERT-BASE
on CommonsenseQA and ROBERTA-LARGE onOpenbookQA respectively and choose a dataset-specific learning rate from 1times10
minus5 2times10
minus5 3times
10minus5 6 times 10
minus5 1 times 10
minus4 for each text encoderbased on the best performance on development setas listed in Table 9 We report the performance ofthese fine-tuned text encoders and also adopt theirdataset-specific optimal learning rates in joint train-ing with graph encoders For models that involve
KG the learning rate of their graph encoders arechosen from 1 times 10
minus4 3 times 10
minus4 1 times 10
minus3 3 times
10minus3 based on their best development set perfor-
mance with ROBERTA-LARGE as the text encoderWe report the optimal learning rates for graph en-coders in Table 10 In training we set the max-imum input sequence length to text encoders to64 batch size to 32 and perform early stoppingAristoRoBERTaV7+MHGRN is the only exceptionIn order to host fair comparison we follow Aris-toRoBERTaV7 and set the batch size to 16 maxinput sequence length to 256 and choose a decoderlearning rate from 1 times 10
minus3 2 times 10
minus5For the input node features we first use tem-
plates to turn knowledge triples in ConceptNet intosentences and feed them into pre-trained BERT-LARGE obtaining a sequence of token embeddingsfrom the last layer of BERT-LARGE for each tripleFor each entity in ConceptNet we perform meanpooling over the tokens of the entityrsquos occurrencesacross all the sentences to form a 1024d vector asits corresponding node feature We use this set offeatures for all our implemented models
We use 2-layer RGCN and single-layer MHGRNacross our experiments
The numbers of parameter for each graph en-coder are listed in Table 11
D Dynamic Programming Algorithm forEq 7
To show that multi-hop message passing can becomputed in linear time we observe that Eq 7 canbe re-written in matrix form
Zk= (Dk)minus1 sum
(r1rk)isinRk
β(r1 rk s)
sdotGArk⋯Ar1FXW1r1
⊤⋯W
krk
⊤
sdotWk+10
⊤⋯W
K0
⊤ (1 le k le K) (13)
where G = diag(exp([g(φ(v1) s) g(φ(vn)s)]) (F is similarly defined) Ar is the adjacencymatrix for relation r and D
k is defined as follows
Dk= diag( sum
(r1rk)isinRk
β(r1 rk s)
sdotGArk⋯Ar1FX1) (1 le k le K) (14)
Using this matrix formulation we can compute Eq7 using dynamic programming
Algorithm 1 Dynamic programming algorithm formulti-hop message passingInput sXAr(1 le r le m)W t
r (r isin R 1 le t le
k)F G δ τOutput Z
1 WKlarr I
2 for k larr K minus 1 to 1 do3 W
klarrW
k+10 W
k+1
4 end for5 for r isin R do6 Mr larr FX7 end for8 for k larr 1 to K do9 if k gt 1 then
10 for r isin R do11 M
primer larr e
δ(rs)ArsumrprimeisinR e
τ(rprimers)MrprimeW
kr
⊤
12 end for13 for r isin R do14 Mr larrM
primer
15 end for16 else17 for r isin R do18 Mr larr e
δ(rs) sdotArMrWkr
⊤
19 end for20 end if21 Z
klarr GsumrisinR MrW
k
22 end for23 Replace W
tr (0 le r le m 1 le t le k) with identity
matrices and X with 1 and re-run line 1 - line 19 tocompute d
1 d
K
24 for k larr 1 to K do25 Z
klarr (diag(dk))minus1
Zk
26 end for27 return Z
1Z
2 Z
k
E Formal Definition of K-hop RN
Definition 1 (K-hop Relation Network) A multi-hop relation network is a function that maps amulti-relational graph to a fixed size vector
KHopRN(G W E H) =K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(j r1 rk i)sdotW (hjoplus(er1⋯erk)oplushi)(15)
where denotes element-wise product and β(⋯) =1(K∣A∣ sdot ∣(j prime i) isin G ∣ j prime isin Q∣) defines thepooling weights
F Expressing K-hop RN with MultiGRN
Theorem 1 Given any W E H there exists a pa-rameter setting such that the output of the modelbecomes KHopRN(GW E H) for arbitrary G
Proof Suppose W = [W1 W2 W3] whereW1 W3 isin Rd3timesd1 W2 isin Rd3timesd2 ForMHRGN we set the parameters as followsH = HUlowast = [I0] isin R(d1+d2)timesd1 blowast =
[01]⊤ isin Rd1+d2 Wtr = diag(1 oplus er) isin
R(d1+d2)times(d1+d2)(r isin R 1 le t le K)V = W3 isin
Rd3timesd1 Vprime= [W1 W2] isin Rd3times(d1+d2) We dis-
able the relation type attention module and enablemessage passing only from Q to A By furtherchoosing σ as the identity function and perform-ing pooling over A we observe that the output of
MultiGRN becomes
1
∣A∣ sumiisinA
hprimei
=1
∣A∣ (V hi + Vprimezi)
=1
K∣A∣K
sumk=1
(V hi + Vprimezki )
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(j r1 rk i)(V hi+
VprimeW
krk⋯W
1r1xj)
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ
β(⋯)(V hi + VprimeW
krk
⋯W1r1Uφ(j)hj + V
primeW
krk⋯W
1r1bφ(j))
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(⋯)(W3hi + W1hj
+ W2(er1 ⋯ erk))
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(⋯)W (hjoplus
(er1 ⋯ erk)oplus hi)= RN(G W E H)
(16)
A Merging Types of Relations inConceptNet
Relation Merged Relation
AtLocationAtLocation
LocatedNear
CausesCausesCausesDesire
MotivatedByGoal
AntonymAntonym
DistinctFrom
HasSubevent
HasSubevent
HasFirstSubeventHasLastSubeventHasPrerequisite
EntailsMannerOf
IsAIsAInstanceOf
DefinedAs
PartOfPartOf
HasA
RelatedToRelatedToSimilarTo
Synonym
Table 7 Relations in ConceptNet that are being mergedin pre-processing RelationX indicates the reverse re-lation of RelationX
We merge relations that are close in semanticsas well as in the general usage of triple instancesin ConceptNet
B Dataset Split Specifications
Train Dev Test
CommonsenseQA (OF) 9 741 1 221 1 140CommonsenseQA (IH) 8 500 1 221 1 241OpenbookQA 4 957 500 500
Table 8 Numbers of instances in different datasetsplits
CommonsenseQA 7 and OpenbookQA 8 all havetheir leaderboards with training and development
7httpswwwtau-nlporgcommonsenseqa8httpsleaderboardallenaiorgopen_
book_qasubmissionspublic
set publicly available As the ground truth labelsfor CommonsenseQA are not readily available formodel analysis we take 1241 examples from of-ficial training examples as our in-house test ex-amples and regard the remaining 8500 ones asour in-house training examples (CommonsenseQA(IH))
C Implementation Details
CommonsenseQA OpenbookQA
BERT-BASE 3 times 10minus5 -
BERT-LARGE 2 times 10minus5 -
ROBERTA-LARGE 1 times 10minus5
1 times 10minus5
Table 9 Learning rate for text encoders on differentdatasets
CommonsenseQA OpenbookQA
RN 3 times 10minus4
3 times 10minus4
RGCN 1 times 10minus3
1 times 10minus3
GconAttn 3 times 10minus4
1 times 10minus4
MHGRN 1 times 10minus3
1 times 10minus3
Table 10 Learning rate for graph encoders on differentdatasets
Param
RN 399KRGCN 365KGconAttn 453KMHGRN 544K
Table 11 Numbers of parameters of different graph en-coders
Our models are implemented in PyTorch We usecross-entropy loss and RAdam (Liu et al 2019a)optimizer We find it beneficial to use separatelearning rates for the text encoder and the graph en-coder We tune learning rates for text encoders andgraph encoders on two datasets We first fine-tuneROBERTA-LARGE BERT-LARGE BERT-BASE
on CommonsenseQA and ROBERTA-LARGE onOpenbookQA respectively and choose a dataset-specific learning rate from 1times10
minus5 2times10
minus5 3times
10minus5 6 times 10
minus5 1 times 10
minus4 for each text encoderbased on the best performance on development setas listed in Table 9 We report the performance ofthese fine-tuned text encoders and also adopt theirdataset-specific optimal learning rates in joint train-ing with graph encoders For models that involve
KG the learning rate of their graph encoders arechosen from 1 times 10
minus4 3 times 10
minus4 1 times 10
minus3 3 times
10minus3 based on their best development set perfor-
mance with ROBERTA-LARGE as the text encoderWe report the optimal learning rates for graph en-coders in Table 10 In training we set the max-imum input sequence length to text encoders to64 batch size to 32 and perform early stoppingAristoRoBERTaV7+MHGRN is the only exceptionIn order to host fair comparison we follow Aris-toRoBERTaV7 and set the batch size to 16 maxinput sequence length to 256 and choose a decoderlearning rate from 1 times 10
minus3 2 times 10
minus5For the input node features we first use tem-
plates to turn knowledge triples in ConceptNet intosentences and feed them into pre-trained BERT-LARGE obtaining a sequence of token embeddingsfrom the last layer of BERT-LARGE for each tripleFor each entity in ConceptNet we perform meanpooling over the tokens of the entityrsquos occurrencesacross all the sentences to form a 1024d vector asits corresponding node feature We use this set offeatures for all our implemented models
We use 2-layer RGCN and single-layer MHGRNacross our experiments
The numbers of parameter for each graph en-coder are listed in Table 11
D Dynamic Programming Algorithm forEq 7
To show that multi-hop message passing can becomputed in linear time we observe that Eq 7 canbe re-written in matrix form
Zk= (Dk)minus1 sum
(r1rk)isinRk
β(r1 rk s)
sdotGArk⋯Ar1FXW1r1
⊤⋯W
krk
⊤
sdotWk+10
⊤⋯W
K0
⊤ (1 le k le K) (13)
where G = diag(exp([g(φ(v1) s) g(φ(vn)s)]) (F is similarly defined) Ar is the adjacencymatrix for relation r and D
k is defined as follows
Dk= diag( sum
(r1rk)isinRk
β(r1 rk s)
sdotGArk⋯Ar1FX1) (1 le k le K) (14)
Using this matrix formulation we can compute Eq7 using dynamic programming
Algorithm 1 Dynamic programming algorithm formulti-hop message passingInput sXAr(1 le r le m)W t
r (r isin R 1 le t le
k)F G δ τOutput Z
1 WKlarr I
2 for k larr K minus 1 to 1 do3 W
klarrW
k+10 W
k+1
4 end for5 for r isin R do6 Mr larr FX7 end for8 for k larr 1 to K do9 if k gt 1 then
10 for r isin R do11 M
primer larr e
δ(rs)ArsumrprimeisinR e
τ(rprimers)MrprimeW
kr
⊤
12 end for13 for r isin R do14 Mr larrM
primer
15 end for16 else17 for r isin R do18 Mr larr e
δ(rs) sdotArMrWkr
⊤
19 end for20 end if21 Z
klarr GsumrisinR MrW
k
22 end for23 Replace W
tr (0 le r le m 1 le t le k) with identity
matrices and X with 1 and re-run line 1 - line 19 tocompute d
1 d
K
24 for k larr 1 to K do25 Z
klarr (diag(dk))minus1
Zk
26 end for27 return Z
1Z
2 Z
k
E Formal Definition of K-hop RN
Definition 1 (K-hop Relation Network) A multi-hop relation network is a function that maps amulti-relational graph to a fixed size vector
KHopRN(G W E H) =K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(j r1 rk i)sdotW (hjoplus(er1⋯erk)oplushi)(15)
where denotes element-wise product and β(⋯) =1(K∣A∣ sdot ∣(j prime i) isin G ∣ j prime isin Q∣) defines thepooling weights
F Expressing K-hop RN with MultiGRN
Theorem 1 Given any W E H there exists a pa-rameter setting such that the output of the modelbecomes KHopRN(GW E H) for arbitrary G
Proof Suppose W = [W1 W2 W3] whereW1 W3 isin Rd3timesd1 W2 isin Rd3timesd2 ForMHRGN we set the parameters as followsH = HUlowast = [I0] isin R(d1+d2)timesd1 blowast =
[01]⊤ isin Rd1+d2 Wtr = diag(1 oplus er) isin
R(d1+d2)times(d1+d2)(r isin R 1 le t le K)V = W3 isin
Rd3timesd1 Vprime= [W1 W2] isin Rd3times(d1+d2) We dis-
able the relation type attention module and enablemessage passing only from Q to A By furtherchoosing σ as the identity function and perform-ing pooling over A we observe that the output of
MultiGRN becomes
1
∣A∣ sumiisinA
hprimei
=1
∣A∣ (V hi + Vprimezi)
=1
K∣A∣K
sumk=1
(V hi + Vprimezki )
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(j r1 rk i)(V hi+
VprimeW
krk⋯W
1r1xj)
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ
β(⋯)(V hi + VprimeW
krk
⋯W1r1Uφ(j)hj + V
primeW
krk⋯W
1r1bφ(j))
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(⋯)(W3hi + W1hj
+ W2(er1 ⋯ erk))
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(⋯)W (hjoplus
(er1 ⋯ erk)oplus hi)= RN(G W E H)
(16)
KG the learning rate of their graph encoders arechosen from 1 times 10
minus4 3 times 10
minus4 1 times 10
minus3 3 times
10minus3 based on their best development set perfor-
mance with ROBERTA-LARGE as the text encoderWe report the optimal learning rates for graph en-coders in Table 10 In training we set the max-imum input sequence length to text encoders to64 batch size to 32 and perform early stoppingAristoRoBERTaV7+MHGRN is the only exceptionIn order to host fair comparison we follow Aris-toRoBERTaV7 and set the batch size to 16 maxinput sequence length to 256 and choose a decoderlearning rate from 1 times 10
minus3 2 times 10
minus5For the input node features we first use tem-
plates to turn knowledge triples in ConceptNet intosentences and feed them into pre-trained BERT-LARGE obtaining a sequence of token embeddingsfrom the last layer of BERT-LARGE for each tripleFor each entity in ConceptNet we perform meanpooling over the tokens of the entityrsquos occurrencesacross all the sentences to form a 1024d vector asits corresponding node feature We use this set offeatures for all our implemented models
We use 2-layer RGCN and single-layer MHGRNacross our experiments
The numbers of parameter for each graph en-coder are listed in Table 11
D Dynamic Programming Algorithm forEq 7
To show that multi-hop message passing can becomputed in linear time we observe that Eq 7 canbe re-written in matrix form
Zk= (Dk)minus1 sum
(r1rk)isinRk
β(r1 rk s)
sdotGArk⋯Ar1FXW1r1
⊤⋯W
krk
⊤
sdotWk+10
⊤⋯W
K0
⊤ (1 le k le K) (13)
where G = diag(exp([g(φ(v1) s) g(φ(vn)s)]) (F is similarly defined) Ar is the adjacencymatrix for relation r and D
k is defined as follows
Dk= diag( sum
(r1rk)isinRk
β(r1 rk s)
sdotGArk⋯Ar1FX1) (1 le k le K) (14)
Using this matrix formulation we can compute Eq7 using dynamic programming
Algorithm 1 Dynamic programming algorithm formulti-hop message passingInput sXAr(1 le r le m)W t
r (r isin R 1 le t le
k)F G δ τOutput Z
1 WKlarr I
2 for k larr K minus 1 to 1 do3 W
klarrW
k+10 W
k+1
4 end for5 for r isin R do6 Mr larr FX7 end for8 for k larr 1 to K do9 if k gt 1 then
10 for r isin R do11 M
primer larr e
δ(rs)ArsumrprimeisinR e
τ(rprimers)MrprimeW
kr
⊤
12 end for13 for r isin R do14 Mr larrM
primer
15 end for16 else17 for r isin R do18 Mr larr e
δ(rs) sdotArMrWkr
⊤
19 end for20 end if21 Z
klarr GsumrisinR MrW
k
22 end for23 Replace W
tr (0 le r le m 1 le t le k) with identity
matrices and X with 1 and re-run line 1 - line 19 tocompute d
1 d
K
24 for k larr 1 to K do25 Z
klarr (diag(dk))minus1
Zk
26 end for27 return Z
1Z
2 Z
k
E Formal Definition of K-hop RN
Definition 1 (K-hop Relation Network) A multi-hop relation network is a function that maps amulti-relational graph to a fixed size vector
KHopRN(G W E H) =K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(j r1 rk i)sdotW (hjoplus(er1⋯erk)oplushi)(15)
where denotes element-wise product and β(⋯) =1(K∣A∣ sdot ∣(j prime i) isin G ∣ j prime isin Q∣) defines thepooling weights
F Expressing K-hop RN with MultiGRN
Theorem 1 Given any W E H there exists a pa-rameter setting such that the output of the modelbecomes KHopRN(GW E H) for arbitrary G
Proof Suppose W = [W1 W2 W3] whereW1 W3 isin Rd3timesd1 W2 isin Rd3timesd2 ForMHRGN we set the parameters as followsH = HUlowast = [I0] isin R(d1+d2)timesd1 blowast =
[01]⊤ isin Rd1+d2 Wtr = diag(1 oplus er) isin
R(d1+d2)times(d1+d2)(r isin R 1 le t le K)V = W3 isin
Rd3timesd1 Vprime= [W1 W2] isin Rd3times(d1+d2) We dis-
able the relation type attention module and enablemessage passing only from Q to A By furtherchoosing σ as the identity function and perform-ing pooling over A we observe that the output of
MultiGRN becomes
1
∣A∣ sumiisinA
hprimei
=1
∣A∣ (V hi + Vprimezi)
=1
K∣A∣K
sumk=1
(V hi + Vprimezki )
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(j r1 rk i)(V hi+
VprimeW
krk⋯W
1r1xj)
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ
β(⋯)(V hi + VprimeW
krk
⋯W1r1Uφ(j)hj + V
primeW
krk⋯W
1r1bφ(j))
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(⋯)(W3hi + W1hj
+ W2(er1 ⋯ erk))
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(⋯)W (hjoplus
(er1 ⋯ erk)oplus hi)= RN(G W E H)
(16)
E Formal Definition of K-hop RN
Definition 1 (K-hop Relation Network) A multi-hop relation network is a function that maps amulti-relational graph to a fixed size vector
KHopRN(G W E H) =K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(j r1 rk i)sdotW (hjoplus(er1⋯erk)oplushi)(15)
where denotes element-wise product and β(⋯) =1(K∣A∣ sdot ∣(j prime i) isin G ∣ j prime isin Q∣) defines thepooling weights
F Expressing K-hop RN with MultiGRN
Theorem 1 Given any W E H there exists a pa-rameter setting such that the output of the modelbecomes KHopRN(GW E H) for arbitrary G
Proof Suppose W = [W1 W2 W3] whereW1 W3 isin Rd3timesd1 W2 isin Rd3timesd2 ForMHRGN we set the parameters as followsH = HUlowast = [I0] isin R(d1+d2)timesd1 blowast =
[01]⊤ isin Rd1+d2 Wtr = diag(1 oplus er) isin
R(d1+d2)times(d1+d2)(r isin R 1 le t le K)V = W3 isin
Rd3timesd1 Vprime= [W1 W2] isin Rd3times(d1+d2) We dis-
able the relation type attention module and enablemessage passing only from Q to A By furtherchoosing σ as the identity function and perform-ing pooling over A we observe that the output of
MultiGRN becomes
1
∣A∣ sumiisinA
hprimei
=1
∣A∣ (V hi + Vprimezi)
=1
K∣A∣K
sumk=1
(V hi + Vprimezki )
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(j r1 rk i)(V hi+
VprimeW
krk⋯W
1r1xj)
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ
β(⋯)(V hi + VprimeW
krk
⋯W1r1Uφ(j)hj + V
primeW
krk⋯W
1r1bφ(j))
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(⋯)(W3hi + W1hj
+ W2(er1 ⋯ erk))
=
K
sumk=1
sum(jr1rki)isinΦk
jisinQ iisinA
β(⋯)W (hjoplus
(er1 ⋯ erk)oplus hi)= RN(G W E H)
(16)