Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual ... ·...

Preview:

Citation preview

Ask, Attend and Answer: Exploring Question-Guided Spatial Attention forVisual Question Answering

Huijuan Xu and Kate SaenkoDepartment of Computer Science, UMass Lowell, USA

hxu1@cs.uml.edu, saenko@cs.uml.edu

Abstract

We propose the Spatial Memory Network model to solvethe problem of Visual Question Answering (VQA). Ourmodel stores neuron activations from different spatial re-gions of the image in its memory, and uses the questioninformation of different granularity to choose relevant re-gions for computing the answer with an explicit attentionmechanism. We evaluate our model on two published visualquestion answering datasets, DAQUAR [3] and VQA [1],and obtain improved results.

1. Introduction

Visual Question Answering (VQA) is an emerging in-terdisciplinary research problem, which has many real-lifeapplications, such as automatic querying of surveillancevideo [9] or assisting the visually impaired [2]. VQA is seenas a Turing test proxy in [3], where handcrafted questionand image features are combined in a latent-world Bayesianframework. More recently, several end-to-end deep VQAnetworks [4, 6] are adapted from captioning models andutilize a recurrent LSTM network, which takes the ques-tion and Convolutional Neural Net (CNN) image featuresas input, and outputs the answer. However, these modelsdo not have any explicit notion of object position, and donot support the computation of intermediate results basedon spatial attention.

We propose the Spatial Memory Network VQA (SMem-VQA) incorporating explicit spatial attention based onmemory networks, which have recently been proposed fortext Question Answering (QA) [10, 7]. The text QA mem-ory network stores textual knowledge in its “memory” inthe form of sentences, and selects relevant sentences to in-fer the answer. Our SMem-VQA model stores the convo-lutional network outputs into the memory, which explicitlyallows spatial attention over the image guided by the ques-tion. Fig. 1 shows the inference process of our two-hopmodel. In the first hop (middle), the attention process cap-

Whatisthechild standingon?skateboard

Whatcoloristhephonebooth?blue

Figure 1. The inference process of our SMem-VQA two-hopmodel on examples from the VQA dataset [1].

v jIs there a cat in the basket?

no

word embedding

CNN

image embeddings

predict answer

next hop

+

memory

(b) Word-guided attention

word-guided attention

See (b)

(a) Overview

.

softmax

first hop v1 vT

AW

WE

WattWatt

Satt

C1

CTAWS

s

i

C

Q

O

AWS

Figure 2. Our proposed Spatial Memory Network for Visual Ques-tion Answering (SMem-VQA).

tures the correspondence between individual words in thequestion and image regions. High attention regions (brightareas) are marked with bounding boxes and the correspond-ing words are highlighted using the same color. In the sec-ond hop (right), the fine-grained evidence gathered in thefirst hop, as well as an embedding of the entire question, areused to collect more exact evidence to predict the answer.

2. Spatial Memory Network for VQAThe question is represented as V = {vj | vj ∈ RN ; j =

1, · · · , T}, where T is the maximum number of words in thequestion and N is the dimensionality of the word vectors.S = {si | si ∈ RM ; i = 1, · · · , L} represents the spatialCNN features at each of the L grid locations (the last convo-lutional layer of GoogLeNet (inception 5b/output) [8]).

Fig. 2 (a) gives an overview of the proposed SMem-VQA

1

Table 1. Test-dev and test-standard results on the Open-Ended VQA dataset (in percentage). Models with ∗ use external training data inaddition to the VQA dataset.

test-dev test-standardOverall yes/no number others Overall yes/no number others

LSTM Q+I [1] 53.74 78.94 35.24 36.42 54.06 - - -ACK∗ [11] 55.72 79.23 36.13 40.08 55.98 79.05 36.10 40.61DPPnet∗ [5] 57.22 80.71 37.24 41.69 57.36 80.28 36.92 42.24iBOWIMG [12] 55.72 76.55 35.03 42.62 55.89 76.76 34.98 42.62SMem-VQA One-Hop 56.56 78.98 35.93 42.09 - - - -SMem-VQA Two-Hop 57.99 80.87 37.32 43.12 58.24 80.8 37.53 43.48

network. First, the CNN activation vectors S = {si} at im-age locations i are projected into the semantic space of thequestion word vectors vj using the “attention” visual em-bedding WA. The results are then used to infer spatial atten-tion weights Watt using the word-guided attention processshown in Fig. 2 (b). Word-guided attention predicts atten-tion determined by the question word that has the maximumcorrelation with embedded visual features at each location,e.g. choosing the word basket to attend to the location of thebasket in the image. The resulting spatial attention weightsWatt are then used to compute a weighted sum over thevisual features embedded via a separate “evidence” trans-formation WE , e.g., selecting evidence for the cat conceptat the basket location. Finally, the weighted evidence vec-tor Satt is combined with the full question embedding Q topredict the answer. An additional hop can repeat the processto gather more evidence.

Equations used in the first hop of SMem-VQA model arelisted as follows:

C = V · (S ·WA + bA)T (1)

Watt = softmax( maxi=1,··· ,T

(Ci)), Ci ∈ RL (2)

Satt = Watt · (S ·WE + bE) (3)Q = WQ · V + bQ (4)

P = softmax(WP · f(Satt +Q) + bP ) (5)

Equations used in the second hop of SMem-VQA modelare listed as follows:

Ohop1 = Satt +Q (6)Chop2 = (S ·WE + bE) ·Ohop1 (7)

Watt2 = softmax(Chop2) (8)Satt2 = Watt2 · (S ·WE2

+ bE2) (9)

P = softmax(WP · f(Ohop1 + Satt2) + bP ) (10)

The correlation matrix C in the first hop provides fine-grained local evidence from each word vectors V in thequestion, while the correlation vector Chop2 in next hopconsiders the global evidence from the whole question rep-resentation Q.

Table 2. Accuracy results on the DAQUAR dataset (in percentage).DAQUAR

Multi-World [3] 12.73Neural-Image-QA [4] 29.27Question LSTM [4] 32.32VIS+LSTM [6] 34.41Question BOW [6] 32.67IMG+BOW [6] 34.17SMem-VQA One-Hop 36.03SMem-VQA Two-Hop 40.07

3. ExperimentsResults on DAQUAR Dataset The 0-1 accuracy resultsof our SMem-VQA model and other baseline models on thereduced DAQUAR dataset [3] are shown in Tab. 2. Model-ing the question only with either the LSTM model or Ques-tion BOW model does equally well in comparison, indicat-ing the question text contains important prior informationfor predicting the answer. Also, on this dataset, VIS+LSTMmodel achieves better accuracy than Neural-Image-QAmodel; the former shows the image only at the first timestepof the LSTM, while the latter does so at each timestep. Incomparison, both our One-Hop model and Two-Hop spatialattention models outperform the IMG+BOW which con-catenates image and question features to predict answer, aswell as other baseline models.

Results on VQA Dataset We use the full release (V1.0)open-ended VQA dataset [1]. We report the test-dev andtest-standard results from the VQA evaluation server inTab. 1. The SMem-VQA Two-Hop model achieves an clearimprovement on test-dev and test-standard compared to theiBOWIMG model which uses the mean pooling of the con-volutional features (inception 5b/output) in each location,demonstrating the value of spatial attention. The SMem-VQA Two-Hop model has slightly better result than theDPPnet model. The DPPnet model uses a large-scale textcorpus to pre-train the Gated Recurrent Unit (GRU) net-work for question representation. Considering the fact thatour model does not use extra data to pre-train the word em-beddings or use extra knowledge base as in ACK model, itsresults are very competitive.

2

References[1] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L.

Zitnick, and D. Parikh. VQA: visual question answering.CoRR, abs/1505.00468, 2015.

[2] W. S. Lasecki, Y. Zhong, and J. P. Bigham. Increasingthe bandwidth of crowdsourced visual question answeringto better support blind users. In Proceedings of the 16th in-ternational ACM SIGACCESS conference on Computers &accessibility, pages 263–264. ACM, 2014.

[3] M. Malinowski and M. Fritz. A multi-world approach toquestion answering about real-world scenes based on uncer-tain input. CoRR, abs/1410.0210, 2014.

[4] M. Malinowski, M. Rohrbach, and M. Fritz. Ask your neu-rons: A neural-based approach to answering questions aboutimages. arXiv preprint arXiv:1505.01121, 2015.

[5] H. Noh, P. H. Seo, and B. Han. Image question answeringusing convolutional neural network with dynamic parameterprediction. arXiv preprint arXiv:1511.05756, 2015.

[6] M. Ren, R. Kiros, and R. S. Zemel. Exploring models anddata for image question answering. CoRR, abs/1505.02074,2015.

[7] S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus. End-to-end memory networks. arXiv preprint arXiv:1503.08895,2015.

[8] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.Going deeper with convolutions. In CVPR 2015, 2015.

[9] K. Tu, M. Meng, M. W. Lee, T. E. Choe, and S.-C. Zhu. Jointvideo and text parsing for understanding events and answer-ing queries. MultiMedia, IEEE, 21(2):42–70, 2014.

[10] J. Weston, S. Chopra, and A. Bordes. Memory networks.CoRR, abs/1410.3916, 2014.

[11] Q. Wu, P. Wang, C. Shen, A. v. d. Hengel, and A. Dick.Ask me anything: Free-form visual question answeringbased on knowledge from external sources. arXiv preprintarXiv:1511.06973, 2015.

[12] B. Zhou, Y. Tian, S. Sukhbaatar, A. Szlam, and R. Fer-gus. Simple baseline for visual question answering. arXivpreprint arXiv:1512.02167, 2015.

3

Recommended