Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for

Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for Fact-based VisualQuestion Answering

Zihao Zhu1,2∗ , Jing Yu1,2∗† , Yujing Wang3 , Yajing Sun1,2 , Yue Hu1,2 and Qi Wu 4

1Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China2School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China

3Microsoft Research Asia, Beijing, China4University of Adelaide, Australia

{zhuzihao, yujing02, sunyajing, huyue}@iie.ac.cn, [email protected], [email protected]

AbstractFact-based Visual Question Answering (FVQA) re-quires external knowledge beyond visible contentto answer questions about an image, which is chal-lenging but indispensable to achieve general VQA.One limitation of existing FVQA solutions is thatthey jointly embed all kinds of information with-out fine-grained selection, which introduces unex-pected noises for reasoning the final answer. Howto capture the question-oriented and information-complementary evidence remains a key challengeto solve the problem. In this paper, we depictan image by a multi-modal heterogeneous graph,which contains multiple layers of information cor-responding to the visual, semantic and factual fea-tures. On top of the multi-layer graph representa-tions, we propose a modality-aware heterogeneousgraph convolutional network to capture evidencefrom different layers that is most relevant to thegiven question. Specifically, the intra-modal graphconvolution selects evidence from each modalityand cross-modal graph convolution aggregates rel-evant information across different modalities. Bystacking this process multiple times, our modelperforms iterative reasoning and predicts the opti-mal answer by analyzing all question-oriented ev-idence. We achieve a new state-of-the-art perfor-mance on the FVQA task and demonstrate the ef-fectiveness and interpretability of our model withextensive experiments. The code is available athttps://github.com/astro-zihao/mucko.

1 IntroductionVisual question answering (VQA) [Antol et al., 2015] is anattractive research direction aiming to jointly analyze multi-modal content from images and natural language. Equippedwith the capacities of grounding, reasoning and translating,a VQA agent is expected to answer a question in natural lan-guage based on an image. Recent works [Cadene et al., 2019;

∗Equal contribution.†Corresponding author.

Input

What is the red cylinder objectin the image is used for?

Image

Question

Dense Captions

Cross-Modal Knowledge Reasoning

woman

red

firehydrant

sidewalk

NextTo

WalkOn

HasProperty

firefighting

UsedFor

streetAtLocationWoman is wearing blue shorts.

Red fire hydrant on the sidewalk.Woman is next to fire hydrant.…Chain on fire hydrant.

visualsemanticfact

On

chain

On

Supporting-fact: <fire hydrant, UsedFor, firefighting> Answer: firefighting

Figure 1: An illustration of our motivation. We represent an im-age by multi-layer graphs and cross-modal knowledge reasoning isconducted on the graphs to infer the optimal answer.

Li et al., 2019b; Ben-Younes et al., 2019] have achievedgreat success in the VQA problems that are answerable bysolely referring to the visible content of the image. How-ever, such kinds of models are incapable of answering ques-tions which require external knowledge beyond what is in theimage. Considering the question in Figure 1, the agent notonly needs to visually localize ‘the red cylinder’, but also tosemantically recognize it as ‘fire hydrant’ and connects theknowledge that ‘fire hydrant is used for firefighting’. There-fore, how to collect the question-oriented and information-complementary evidence from visual, semantic and knowl-edge perspectives is essential to achieve general VQA.

To advocate research in this direction, [Wang et al., 2018]introduces the ‘Fact-based’ VQA (FVQA) task for answeringquestions by joint analysis of the image and the knowledgebase of facts. The typical solutions for FVQA build a factgraph with fact triplets filtered by the visual concepts in theimage and select one entity in the graph as the answer. Ex-isting works [Wang et al., 2017; Wang et al., 2018] parse thequestion as keywords and retrieve the supporting-entity onlyby keyword matching. This kind of approaches is vulnera-ble when the question does not exactly mention the visualconcepts (e.g. synonyms and homographs) or the mentionedinformation is not captured in the fact graph (e.g. the visual

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

1097

attribute ‘red’ in Figure 1 may be falsely omitted). To re-solve these problems, [Narasimhan et al., 2018] introducesvisual information into the fact graph and infers the answerby implicit graph reasoning under the guidance of the ques-tion. However, they provide the whole visual informationequally to each graph node by concatenation of the image,question and entity embeddings. Actually, only part of thevisual content are relevant to the question and a certain entity.Moreover, the fact graph here is still homogeneous since eachnode is represented by a fixed form of image-question-entityembedding, which limits the model’s flexibility of adaptivelycapturing evidence from different modalities.

In this work, we depict an image as a multi-modal hetero-geneous graph, which contains multiple layers of informationcorresponding to different modalities. The proposed model isfocused on Multi-Layer Cross-Modal Knowledge Reasoningand we name it as Mucko for short. Specifically, we encodean image by three layers of graphs, where the object appear-ance and their relationships are kept in the visual layer, thehigh-level abstraction for bridging the gaps between visualand factual information is provided in the semantic layer,and the corresponding knowledge of facts are supported inthe fact layer. We propose a modality-aware heterogeneousgraph convolutional network to adaptively collect comple-mentary evidence in the multi-layer graphs. It can be per-formed by two procedures. First, the Intra-Modal KnowledgeSelection procedure collects question-oriented informationfrom each graph layer under the guidance of question; Then,the Cross-Modal Knowledge Reasoning procedure capturescomplementary evidence across different layers.

The main contributions of this paper are summarized asfollows: (1) We comprehensively depict an image by a het-erogeneous graph containing multiple layers of informationbased on visual, semantic and knowledge modalities. Weconsider these three modalities jointly and achieve significantimprovement over state-of-the-art solutions. (2) We propose amodality-aware heterogeneous graph convolutional networkto capture question-oriented evidence from different modal-ities. Especially, we leverage an attention operation in eachconvolution layer to select the most relevant evidence for thegiven question, and the convolution operation is responsiblefor adaptive feature aggregation. (3) We demonstrate goodinterpretability of our approach and provide case study indeep insights. Our model automatically tells which modality(visual, semantic or factual) and entity have more contribu-tions to answer the question through visualization of attentionweights and gate values.

2 Related WorkVisual Question Answering. The typical solutions forVQA are based on the CNN-RNN architecture [Malinowskiet al., 2015] and leverage global visual features to repre-sent image, which may introduce noisy information. Vari-ous attention mechanisms [Yang et al., 2016; Lu et al., 2016;Anderson et al., 2018] have been exploited to highlight vi-sual objects that are relevant to the question. However, theytreat objects independently and ignore their informative rela-tionships. [Battaglia et al., 2018] demonstrates that human’s

ability of combinatorial generalization highly depends on themechanisms for reasoning over relationships. Consistent withsuch proposal, there is an emerging trend to represent theimage by graph structure to depict objects and relationshipsin VQA and other vision-language tasks [Hu et al., 2019b;Wang et al., 2019a; Li et al., 2019b]. As an extension, [Jianget al., 2020] exploits natural language to enrich the graph-based visual representations. However, it solely captures thesemantics in natural language by LSTM, which lacking offine-grained correlations with the visual information. To goone step further, we depict an image by multiple layers ofgraphs from visual, semantic and factual perspectives to col-lect fine-grained evidence from different modalities.Fact-based Visual Question Answering. Human can eas-ily combine visual observation with external knowledge foranswering questions, which remains challenging for algo-rithms. [Wang et al., 2018] introduces a fact-based VQAtask, which provides a knowledge base of facts and associateseach question with a supporting-fact. Recent works basedon FVQA generally select one entity from fact graph as theanswer and falls into two categories: query-mapping basedmethods and learning based methods. [Wang et al., 2017] re-duces the question to one of the available query templates andthis limits the types of questions that can be asked. [Wang etal., 2018] automatically classifies and maps the question toa query which does not suffer the above constraint. Amongboth methods, however, visual information are used to ex-tract facts but not introduced during the reasoning process.[Narasimhan et al., 2018] applies GCN on the fact graphwhere each node is represented by the fixed form of image-question-entity embedding. However, the visual informationis wholly provided which may introduce redundant informa-tion for prediction. In this paper, we decipt an image by multi-layer graphs and perform cross-modal heterogeneous graphreasoning on them to capture complementary evidence fromdifferent layers that most relevant to the question.Heterogeneous Graph Neural Networks. Graph neu-ral networks are gaining fast momentum in the last fewyears [Wu et al., 2019]. Compared with homogeneousgraphs, heterogeneous graphs are more common in the realworld. [Schlichtkrull et al., 2018] generalizes graph convo-lutional network (GCN) to handle different relationships be-tween entities in a knowledge base, where each edge withdistinct relationships is encoded independently. [Wang et al.,2019b; Hu et al., 2019a] propose heterogeneous graph atten-tion networks with dual-level attention mechanism. All ofthese methods model different types of nodes and edges ona unified graph. In contrast, the heterogeneous graph in thiswork contains multiple layers of subgraphs and each layerconsists of nodes and edges coming from different modali-ties. For this specific constrain, we propose the intra-modaland cross-modal graph convolutions for reasoning over suchmulti-modal heterogeneous graphs.

3 MethodologyGiven an image I and a questionQ, the task aims to predict ananswer A while leveraging external knowledge base, whichconsists of facts in the form of triplet, i.e. < e1, r, e2 >,


1098

Wear

NextTo

𝒆𝒊𝒋𝑽

…

fire hydrant

person

car

Object Regions

𝒗𝒋𝑽

𝒗𝒊𝑽

Dense Captions

Fire hydrant

Woman SidewalkWalkOn

On

Shorts

FasterR-CNN

What is the red cylinder object in the image is used for?

Red

Image

Question

Candidate Facts

Fire hydrant

Firefighting

Car

Red Street

TransportUsedFor

HasProperty UsedFor

AtLocation

Semantic GraphParsing

Spatial Relationship

𝒆𝒊𝒋𝑽

𝒗𝒋𝑽

𝒗𝒊𝑽

Wear

NextTo

Fire hydrant

Woman SidewalkWalkOn

On

ShortsRed

Fact Retrieval

DenseCap

Multi-Modal Heterogeneous Graph Construction Cross-Modal Heterogeneous Graph Reasoning

Visual-to-Fact Conv.

Semantic-to-Fact Conv.

Fact-to-Fact Aggr.

Visual Graph

Fact Graph

Semantic Graph 0.20.1

0.4

0.6

0.1

0.9Answer

Property Property

Knowledgebase of facts

Fire hydrant

Firefighting

Car

Red Street

TransportUsedFor

HasProperty UsedForAt

Location

Woman is wearing blue shorts.Red fire hydrant on the sidewalk.Woman is next to fire hydrant.…Chain on fire hydrant.

<Fire hydrant, UsedFor, Firefighting><Fire hydrant, AtLocation, Street><Fire hydrant, HasProperty, Red>…<Car, UsedFor, Transport>

Intra-ModalKnowledge Selection

Cross-ModalKnowledge Reasoning

Figure 2: An overview of our model. The model contains two modules: Multi-modal Heterogeneous Graph Construction aims to depict animage by multiple layers of graphs and Cross-modal Hetegeneous Graph Reasoning supports intra-modal and cross-modal evidence selection.

where e1 is a visual concept in the image, e2 is an attributeor phrase and r represents the relationship between e1 ande2. The key is to choose a correct entity, i.e. either e1 ore2, from the supporting fact as the predicted answer. We firstintroduce a novel scheme of depicting an image by three lay-ers of graphs, including the visual graph, semantic graph andfact graph respectively, imitating the understanding of variousproperties of an object and the relationships. Then we per-form cross-modal heterogeneous graph reasoning that con-sists of two parts: Intra-Modal Knowledge Selection aimsto choose question-oriented knowledge from each layer ofgraphs by intra-modal graph convolutions, and Cross-ModalKnowledge Reasoning adaptively selects complementary ev-idence across three layers of graphs by cross-modal graphconvolutions. By stacking the above two processes multipletimes, our model performs iterative reasoning across all themodalities and results in the optimal answer by jointly ana-lyzing all the entities. Figure 2 gives detailed illustration ofour model.

3.1 Multi-Modal Graph Construction

Visual Graph Construction. Since most of the questionsin FVQA grounded in the visual objects and their relation-ships, we construct a fully-connected visual graph to repre-sent such evidence at appearance level. Given an image I , weuse Faster-RCNN [Ren et al., 2017] to identify a set of objectsO = {oi}Ki=1 (K = 36), where each object oi is associatedwith a visual feature vector vi ∈ Rdv (dv = 2048), a spatialfeature vector bi ∈ Rdb (db = 4) and a corresponding label.Specifically, bi = [xi, yi, wi, hi], where (xi, yi), hi and wirespectively denote the coordinate of the top-left corner, theheight and width of the bounding box. We construct a visualgraph GV = (VV , EV ) over O, where VV = {vVi }Ki=1 is thenode set and each node vVi corresponds to a detected objectoi. The feature of node vVi is represented by vVi . Each edgeeVij ∈ EV denotes the relative spatial relationships betweentwo objects. We encode the edge feature by a 5-dimensionalvector, i.e. rVij = [

xj−xi

wi,yj−yihi

,wj

wi,hj

hi,wjhj

wihi].

Semantic Graph Construction. In addition to visual in-formation, high-level abstraction of the objects and relation-ships by natural language provides essential semantic infor-mation. Such abstraction is indispensable to associate the vi-sual objects in the image with the concepts mentioned in bothquestions and facts. In our work, we leverage dense captions[Johnson et al., 2016] to extract a set of local-level seman-tics in an image, ranging from the properties of a single ob-ject (color, shape, emotion, etc.) to the relationships betweenobjects (action, spatial positions, comparison, etc.). We de-cipt an image by D dense captions, denoted as Z = {zi}Di=1,where zi is a natural language description about a local re-gion in the image. Instead of using monolithic embeddings torepresent the captions, we exploit to model them by a graph-based semantic representation, denoted as GS = (VS , ES),which is constructed by a semantic graph parsing model [An-derson et al., 2016]. The node vSi ∈ VS represents the nameor attribute of an object extracted from the captions while theedge eSij ∈ ES represents the relationship between vSi andvSj . We use the averaged GloVe embeddings [Pennington etal., 2014] to represent vSi and eSij , denoted as vSi and rSij ,respectively. The graph representation retains the relationalinformation among concepts and unifies the representationsin graph domain, which is better for explicit reasoning acrossmodalities.Fact Graph Construction. To find the optimal supporting-fact, we first retrieve relevant candidate facts from knowledgebase of facts following a scored based approach proposed in[Narasimhan et al., 2018]. We compute the cosine similarityof the embeddings of every word in the fact with the words inthe question and the words of visual concepts detected in theimage. Then we average these values to assign a similarityscore to the fact. The facts are sorted based on the similar-ity and the 100 highest scoring facts are retained, denoted asf100. A relation type classifier is trained additionally to fur-ther filter the retrieved facts. Specifically, we feed the lasthidden state of LSTM to an MLP layer to predict the rela-tion type ri of a question. We retain the facts among f100only if their relationships agree with ri, i.e. frel = f ∈


1099

f100 : r(f) ∈ {ri} ({ri} contains top-3 predicted relation-ships in experiments). Then a fact graph GF = (VF , EF ) isbuilt upon frel as the candidate facts can be naturally orga-nized as graphical structure. Each node vFi ∈ VF denotesan entity in frel and is represented by GloVe embedding ofthe entity, denoted as vFi . Each edge eFij ∈ EF denotes therelationship between vFi and vFj and is represented by GloVeembedding rij . The topological structure among facts can beeffectively exploited by jointly considering all the entities inthe fact graph.

3.2 Intra-Modal Knowledge SelectionSince each layer of graphs contains modality-specific knowl-edge relevant to the question, we first select valuable evi-dence independently from the visual graph, semantic graphand fact graph by Visual-to-Visual Convolution, Semantic-to-Semantic Convolution and Fact-to-Fact Convolution re-spectively. These three convolutions share the common oper-ations but differ in their node and edge representations corre-sponding to the graph layers. Thus we omit the superscriptof node representation v and edge representation r in the restof this section. We first perform attention operations to high-light the nodes and edges that are most relevant to the ques-tion q and consequently update node representations via intra-modal graph convolution. This process mainly consists of thefollowing three steps:Question-guided Node Attention. We first evaluate therelevance of each node corresponding to the question by at-tention mechanism. The attention weight for vi is computedas:

αi = softmax(wTa tanh(W1vi + W2q)) (1)

where W1,W2 and wa (as well as W3,..., W11, wb, wc men-tioned below) are learned parameters. q is question embed-ding encoded by LSTM.Question-guided Edge Attention. Under the guidance ofquestion, we then evaluate the importance of edge eji con-strained by the neighbor node vj regarding to vi as:

βji = softmax(wTb tanh(W3v

′j + W4q

′)) (2)

where v′j = W5[vj , rji], q′ = W6[vi, q] and [·, ·] denotes

concatenation operation.Intra-Modal Graph Convolution. Given the node andedge attention weights learned in Eq. 1 and Eq. 2, the noderepresentations of each layer of graphs are updated followingthe message-passing framework [Gilmer et al., 2017]. Wegather the neighborhood information and update the repre-sentation of vi as:

mi =∑j∈Ni

βjiv′j (3)

vi = ReLU(W7[mi, αivi]) (4)

where Ni is the neighborhood set of node vi.We conduct the above intra-modal knowledge selection on

GV , GS and GF independently and obtain the updated noderepresentations, denoted as {vVi }N

V

i=1 , {vSi }NS

i=1 and {vFi }NF

i=1accordingly.

3.3 Cross-Modal Knowledge ReasoningTo answer the question correctly, we fully consider the com-plementary evidence from visual, semantic and factual in-formation. Since the answer comes from one entity in thefact graph, we gather complementary information from visualgraph and semantic graph to fact graph by cross-modal con-volutions, including visual-to-fact convolution and semantic-to-fact convolution. Finally, a fact-to-fact aggregation is per-formed on the fact graph to reason over all the entities andform a global decision.Visual-to-Fact Convolution. For the entity vFi in factgraph, the attention value of each node vVj in the visual graphw.r.t. vFi is calculated under the guidance of question:

γV-Fji = softmax(wc tanh(W8v

Vj + W9[v

Fi , q])) (5)

The complementary information mV-Fi from visual graph

for vFi is computed as:

mV-Fi =

∑j∈NV

γV-Fji vVj (6)

Semantic-to-Fact Convolution. The complementary infor-mation mS-F

i from the semantic graph is computed in thesame way as in Eq. 5 and Eq. 6.

Then we fuse the complementary knowledge for vFi fromthree layers of graphs via a gate operation:

gatei = σ(W10[mV-Fi ,mS-F

i , vFi ]) (7)

vFi = W11(gatei ◦ [mV-Fi ,mS-F

i , vFi ]) (8)where σ is sigmoid function and “◦” denotes element-wiseproduct.Fact-to-Fact Aggregation. Given a set of candidate enti-ties in the fact graph GF , we aim to globally compare all theentities and select an optimal one as the answer. Now the rep-resentation of each entity in the fact graph gathers question-oriented information from three modalities. To jointly evalu-ate the possibility of each entity, we perform the attention-based graph convolutional network similar to Fact-to-FactConvolution introduced in Section 3.2 to aggregate informa-tion in the fact graph and obtain the transformed entity repre-sentations.

We iteratively perform intra-modal knowledge selectionand cross-modal knowledge reasoning in multiple steps to ob-tain the final entity representations. After T steps, each en-tity representation v

F (T )i captures the structural information

within T -hop neighborhood across three layers.

3.4 LearningThe concatenation of entity representation v

F (T )i and ques-

tion embedding q is passed to a binary classifier to predict itsprobability as the answer , i.e. yi = pθ([v

F (T )i , q]). We apply

the binary cross-entropy loss in the training process:

ln = −∑i∈NF

[a · yi ln yi + b · (1− yi) ln(1− yi)

](9)

where yi is the ground truth label for vFi and a, b representloss function weights for positive and negative samples re-spectively. The entity with the largest probability is selectedas the final answer.


1100

Method Overall Accuracytop-1 top-3

LSTM-Question+Image+Pre-VQA 24.98 40.40Hie-Question+Image+Pre-VQA 43.14 59.44FVQA (top-3-QQmaping) 56.91 64.65FVQA (Ensemble) 58.76 -Straight to the Facts (STTF) 62.20 75.60Reading Comprehension 62.96 70.08Out of the Box (OB) 69.35 80.25Human 77.99 -Mucko 73.06 85.94

Table 1: State-of-the-art comparison on FVQA dataset.

Method Overall Accuracytop-1 top-3

Mucko (full model) 73.06 85.941 w/o Semantic Graph 71.28 82.762 w/o Visual Graph 69.12 78.053 w/o Semantic Graph & Visual Graph 20.43 29.104 S-to-F Concat. 67.82 76.655 V-to-F Concat. 69.93 80.126 V-to-F Concat. & S-to-F Concat. 70.68 82.047 w/o relationships 72.10 83.75

Table 2: Ablation study of key components of Mucko.

4 ExperimentsDataset. We evaluate Mucko on the FVQA dataset [Wanget al., 2018]. It consists of 2,190 images, 5,286 questionsand a knowledge base of 193,449 facts. Facts are constructedby extracting top visual concepts in the dataset and queryingthese concepts in WebChild, ConceptNet and DBPedia.Evaluation Metrics. We follow the metrics in [Wang et al.,2018] to evaluate the performance. The top-1 and top-3 ac-curacy is calculated for each method. The averaged accuracyof 5 test splits is reported as the overall accuracy.Implementation Details. We select the top-10 dense cap-tions according to their confidence. The max sentence lengthof dense captions and the questions is set to 20. The hiddenstate size of all the LSTM blocks is set to 512. We set a = 0.7and b = 0.3 in the binary cross-entropy loss. Our model istrained by Adam optimizer with 20 epochs, where the mini-batch size is 64 and the dropout ratio is 0.5. Warm up strategyis applied for 2 epochs with initial learning rate 1× 10−3 andwarm-up factor 0.2. Then we use cosine annealing learningstrategy with initial learning rate ηmax = 1 × 10−3 and ter-mination learning rate ηmin = 3.6×10−4 for the rest epochs.

4.1 Comparison with State-of-the-Art MethodsTable 1 shows the comparison of Mucko with state-of-the-art models, including CNN-RNN based approaches [Wanget al., 2018], i.e. LSTM-Question+Image+Pre-VQA andHie-Question+Image+Pre-VQA, semantic parsing based ap-proaches [Wang et al., 2018], i.e. FVQA (top-3-QQmaping)and FVQA (Ensemble), learning-based approaches, i.e.Straight to the Facts (STTF) [Narasimhan and Schwing,2018] and Out of the Box (OB) [Narasimhan et al., 2018], andReading Comprehension based approach [Li et al., 2019a].Our model consistently outperforms all the approaches on allthe metrics and achieves 3.71% boost on top-1 accuracy and

5.69% boost on top-3 accuracy compared with the state-of-the-art model. The model OB is most relevant to Mucko inthat it leverages graph convolutional networks to jointly as-sess all the entities in the fact graph. However, it introducesthe global image features equally to all the entities without se-lection. By collecting question-oriented visual and semanticinformation via modality-aware heterogeneous graph convo-lutional networks, our model gains remarkable improvement.

4.2 Ablation StudyIn Table 2, we shows ablation results to verify the contribu-tion of each component in our model. (1) In models ‘1-3’,we evaluate the influence of each layer of graphs on theperformance. We observe that the top-1 accuracy of ‘1’ and‘2’ respectively decreases by 1.1% and 3.94% compared withthe full model, which indicates that both semantic and visualgraphs are beneficial to provide valuable evidence for answerinference. Thereinto, the visual information has greater im-pact than the semantic part. When removing both semanticand visual graphs, ‘3’ results in a significant decrease. (2)In models ‘4-6’, we assess the effectiveness of the proposedcross-modal graph convolutions. ‘4’, ‘5’ and ‘6’ respec-tively replace the ‘Semantic-to-Fact Conv.’ in ‘2’, ‘Visual-to-Fact Conv.’ in ‘1’ and both in full model by concatenation,i.e. concatenating the mean pooling of all the semantic/visualnode features with each entity feature. The performance de-creases when replacing the convolution from either S-to-For V-to-F, or both simultaneously, which proves the bene-fits of cross-modal convolution in gathering complementaryevidence from different modalities. (3) We evaluate the in-fluence of the relationships in the heterogeneous graph. Weomit the relational features rij in all the three layers in ‘7’and the performance decreases by nearly 1% on top-1 accu-racy. It proves the benefits of relational information, thoughit is less influential than the modality information.

4.3 InterpretabilityOur model is interpretable by visualizing the attentionweights and gate values in the reasoning process. From casestudy in Figure 3, we conclude with the following three in-sights: (1) Mucko is capable to reveal the knowledge selec-tion mode. The first two examples indicate that Mucko cap-tures the most relevant visual, semantic and factual evidenceas well as complementary information across three modali-ties. In most cases, factual knowledge provides predominantclues compared with other modalities according to gate val-ues because FVQA relies on external knowledge to a greatextent. Furthermore, more evidence comes from the semanticmodality when the question involves complex relationships.For instance, the second question involving the relationshipbetween ‘hand’ and ‘while round thing’ needs more semanticclues. (2) Mucko has advantages over the state-of-the-artmodel. The third example compares the predicted answer ofOB with Mucko. Mucko collects relevant visual and seman-tic evidence to make each entity discriminative enough forpredicting the correct answer while OB failing to distinguishrepresentations of ‘laptop’ and ‘keyboard’ without feature se-lection. (3) Mucko fails when multiple answers are rea-sonable for the same question. Since both ‘wedding’ and


1101

Case Visual Graph Fact Graph Semantic Graph

Question: Which device in theimage can free peoples hand?Pred. / Gt Answer: dishwasher (✓)

dishwasher

hand washautomation

RelatedTo(0.

21)Inefficient(0.33)

0.29

0.11

0.350.31

black

dishwasher

pot

counter

0.41

0.19

0.20

0.08

Ratio of total gate values: 32.48% Ratio of total gate values: 46.35% Ratio of total gate values: 21.17%

Question: What is the white roundthing held by hand in the imageused for?Pred./Gt Answer: hold food (✓)

hold food

bowl plate

white

bowl

hand

food0.13

0.24

On(0.20)

Above(0.37)

Hold(0.35)

In(0.36)

0.09table

Ratio of total gate values: 20.03% Ratio of total gate values: 49.54% Ratio of total gate values: 30.43%

RelatedTo(0.18)

assistive technology

UsedFor(0.24) UsedFor(0.20)

UsedFor(0.13)

spoon cup

UsedFor(

0.15)

0.37

kitchen

Has(0.12)

Property(0.42)

On(0.35)

Property(

0.39)

0.110.18

0.16 0.24

0.21

0.10 0.13

0.15

、0.40

0.25 0.11

0.410.17

0.48

0.08 0.11

0.50

0.320.09

0.20

0.29

Question: Where can you find theright object on the table shown inthe image?GTAnswer: weddingPred. Answer: party (✘)

Ratio of total gate values: 33.87%

wedding woman

0.23

cake

glasses

、

UsedFor(0

.37) RelatedTo(0.24)

UsedFor(0.21)

cake

table

candle

round

whiteRatio of total gate values: 43.61% Ratio of total gate values: 29.33.%

On(0.41)

On(0.38)

Property(

0.21)

Property(0.38)0.37

0.18

0.16

0.18

0.15

0.11

0.39

0.13

0.120.19

0.20

0.40

0.210.14

0.39

chair

AtLocatio

n(0.12)0.08

Question: which part of themachine in the image can be usedfor typing?Pred./GTAnswer: keyboard (✓)OB. Answer: laptop (✘)

Ratio of total gate values: 27.06%

、

0.16

0.100.32keyboard

enter

0.19

mouse

computer

RelatedTo

(0.09) RelatedTo(0.27)

Has(0.11)

0.09

0.10

0.20

type

Require(0

.41) 0.24

desk

keyboard

black

0.35

0.31

0.09

0.11

Property(0.38)

NextTo(0.62)

On(0.59)

mouse

0.23 0.12

0.120.09

0.09

0.12

0.11

0.19

0.28

0.340.45

0.14

Ratio of total gate values: 40.22% Ratio of total gate values: 25.91%

party

0.10

Figure 3: Visualization for Mucko. Visual graph highlights the most relevant subject (red box) according to attention weights of each object(αV in Eq. 1) and the objects (orange boxes) with top-2 attended relationships (βV in Eq. 2). Fact graph shows the predicted entity (centernode) and its top-4 attended neighbors (αF in Eq. 1). Semantic graph shows the most relevant concept (center node) and its up to top-4attended neighbors (αS in Eq. 1). Each edge is marked with attention value (βF/S in Eq. 2). Dash lines represent visual-to-fact convolution(orange) and semantic-to-fact convolution weights (blue) of the predicted entity (γV-F, γS-F in Eq. 5). The thermogram on the top visualizesthe gate values (gatei in Eq. 7) of visual embedding (left), entity embedding (middle) and semantic embedding (right).

#Retrieved facts @50 @100 @150 @200Rel@1 (top-1 accuracy)Rel@1 (top-3 accuracy)

55.56 70.62 65.94 59.7764.09 81.95 73.41 66.32

Rel@3 (top-1 accuracy)Rel@3 (top-3 accuracy)

58.93 73.06 70.12 65.9368.50 85.94 81.43 74.87

Table 3: Overall accuracy with different number of retrieved candi-date facts and different number of relation types.

#Steps 1 2 3top-1 accuracy 62.05 73.06 70.43top-3 accuracy 71.87 85.94 81.32

Table 4: Overall accuracy with different number of reasoning steps.

‘party’ may have cakes, the predicted answer ‘party’ in thelast example is reasonable from human judgement.

4.4 Parameter AnalysisIn Table 3, we vary the number of retrieved candidate factsand relation types for candidate filtering. We achieve thehighest downstream accuracy with top-100 candidate factsand top-3 relation types. In Table 4, we evaluate the influ-

ence of different number of reasoning steps T . We find thattwo reasoning steps achieve the best performance. We use theabove settings in our full model.

5 ConclusionIn this paper, we propose Mucko for visual question answer-ing requiring external knowledge, which focuses on multi-layer cross-modal knowledge reasoning. We novelly de-pict an image by a heterogeneous graph with multiple lay-ers of information corresponding to visual, semantic and fac-tual modalities. We propose a modality-aware heterogeneousgraph convolutional network to select and gather intra-modaland cross-modal evidence iteratively. Our model outperformsthe state-of-the-art approaches remarkably and obtains inter-pretable results on the benchmark dataset.

AcknowledgementsThis work is supported by the National Key Research andDevelopment Program (Grant No.2017YFB0803301).


1102

References[Anderson et al., 2016] Peter Anderson, Basura Fernando,

Mark Johnson, and Stephen Gould. Spice: Semanticpropositional image caption evaluation. In ECCV, pages382–398, 2016.

[Anderson et al., 2018] Peter Anderson, Xiaodong He, ChrisBuehler, Damien Teney, Mark Johnson, Stephen Gould,and Lei Zhang. Bottom-up and top-down attention for im-age captioning and visual question answering. In CVPR,pages 6319–6328, 2018.

[Antol et al., 2015] Stanislaw Antol, Aishwarya Agrawal, Ji-asen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zit-nick, and Devi Parikh. Vqa: Visual question answering. InICCV, pages 2425–2433, 2015.

[Battaglia et al., 2018] Peter W Battaglia, Jessica B Ham-rick, Victor Bapst, Alvaro Sanchez-Gonzalez, ViniciusZambaldi, Mateusz Malinowski, Andrea Tacchetti, DavidRaposo, Adam Santoro, Ryan Faulkner, et al. Relationalinductive biases, deep learning, and graph networks. arXivpreprint arXiv:1806.01261, 2018.

[Ben-Younes et al., 2019] Hedi Ben-Younes, Remi Cadene,Nicolas Thome, and Matthieu Cord. Block: Bilinear su-perdiagonal fusion for visual question answering and vi-sual relationship detection. In AAAI, pages 8102–8109,2019.

[Cadene et al., 2019] Remi Cadene, Hedi Ben-Younes,Matthieu Cord, and Nicolas Thome. Murel: Multimodalrelational reasoning for visual question answering. InCVPR, pages 1989–1998, 2019.

[Gilmer et al., 2017] Justin Gilmer, Samuel S Schoenholz,Patrick F Riley, Oriol Vinyals, and George E Dahl. Neuralmessage passing for quantum chemistry. In ICML, pages1263–1272, 2017.

[Hu et al., 2019a] Linmei Hu, Tianchi Yang, Chuan Shi,Houye Ji, and Xiaoli Li. Heterogeneous graph attentionnetworks for semi-supervised short text classification. InEMNLP, pages 4823–4832, 2019.

[Hu et al., 2019b] Ronghang Hu, Anna Rohrbach, TrevorDarrell, and Kate Saenko. Language-conditioned graphnetworks for relational reasoning. In ICCV, pages 10294–10303, 2019.

[Jiang et al., 2020] Xiaoze Jiang, Jing Yu, Zengchang Qin,Yingying Zhuang, Xingxing Zhang, Yue Hu, and Qi Wu.Dualvd: An adaptive dual encoding model for deep visualunderstanding in visual dialogue. In AAAI, 2020.

[Johnson et al., 2016] Justin Johnson, Andrej Karpathy, andLi Fei-Fei. Densecap: Fully convolutional localizationnetworks for dense captioning. In CVPR, pages 4565–4574, 2016.

[Li et al., 2019a] Hui Li, Peng Wang, Chunhua Shen, andAnton van den Hengel. Visual question answering as read-ing comprehension. In CVPR, pages 6319–6328, 2019.

[Li et al., 2019b] Linjie Li, Zhe Gan, Yu Cheng, and JingjingLiu. Relation-aware graph attention network for visualquestion answering. In ICCV, pages 10313–10322, 2019.

[Lu et al., 2016] Jiasen Lu, Jianwei Yang, Dhruv Batra, andDevi Parikh. Hierarchical question-image co-attention forvisual question answering. In NeurIPS, pages 289–297,2016.

[Malinowski et al., 2015] Mateusz Malinowski, MarcusRohrbach, and Mario Fritz. Ask your neurons: A neural-based approach to answering questions about images. InICCV, pages 1–9, 2015.

[Narasimhan and Schwing, 2018] Medhini Narasimhan andAlexander G Schwing. Straight to the facts: Learningknowledge base retrieval for factual visual question an-swering. In ECCV, pages 451–468, 2018.

[Narasimhan et al., 2018] Medhini Narasimhan, SvetlanaLazebnik, and Alexander Schwing. Out of the box: Rea-soning with graph convolution nets for factual visual ques-tion answering. In NeurIPS, pages 2654–2665, 2018.

[Pennington et al., 2014] Jeffrey Pennington, RichardSocher, and Christopher D Manning. Glove: Globalvectors for word representation. In EMNLP, pages1532–1543, 2014.

[Ren et al., 2017] Shaoqing Ren, Kaiming He, Ross Gir-shick, and Jian Sun. Faster r-cnn: Towards real-time ob-ject detection with region proposal networks. 39(6):1137–1149, 2017.

[Schlichtkrull et al., 2018] Michael Schlichtkrull, Thomas NKipf, Peter Bloem, Rianne Van Den Berg, Ivan Titov, andMax Welling. Modeling relational data with graph convo-lutional networks. In ESWC, pages 593–607, 2018.

[Wang et al., 2017] Peng Wang, Qi Wu, Chunhua Shen, An-thony R. Dick, and Anton van den Hengel. Explicitknowledge-based reasoning for visual question answering.In IJCAI, pages 1290–1296, 2017.

[Wang et al., 2018] Peng Wang, Qi Wu, Chunhua Shen, An-thony Dick, and Anton van den Hengel. Fvqa: Fact-basedvisual question answering. TPAMI, 40(10):2413–2427,2018.

[Wang et al., 2019a] Peng Wang, Qi Wu, Jiewei Cao, Chun-hua Shen, Lianli Gao, and Anton van den Hengel. Neigh-bourhood watch: Referring expression comprehension vialanguage-guided graph attention networks. In CVPR,pages 1960–1968, 2019.

[Wang et al., 2019b] Xiao Wang, Houye Ji, Chuan Shi, BaiWang, Yanfang Ye, Peng Cui, and Philip S Yu. Hetero-geneous graph attention network. In WWW, pages 2022–2032, 2019.

[Wu et al., 2019] Zonghan Wu, Shirui Pan, Fengwen Chen,Guodong Long, Chengqi Zhang, and Philip S Yu. Acomprehensive survey on graph neural networks. arXivpreprint arXiv:1901.00596, 2019.

[Yang et al., 2016] Zichao Yang, Xiaodong He, JianfengGao, Li Deng, and Alex Smola. Stacked attention net-works for image question answering. In CVPR, pages 21–29, 2016.


1103

Documents

Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for