3
Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering Huijuan Xu and Kate Saenko Department of Computer Science, UMass Lowell, USA [email protected], [email protected] Abstract We propose the Spatial Memory Network model to solve the problem of Visual Question Answering (VQA). Our model stores neuron activations from different spatial re- gions of the image in its memory, and uses the question information of different granularity to choose relevant re- gions for computing the answer with an explicit attention mechanism. We evaluate our model on two published visual question answering datasets, DAQUAR [3] and VQA [1], and obtain improved results. 1. Introduction Visual Question Answering (VQA) is an emerging in- terdisciplinary research problem, which has many real-life applications, such as automatic querying of surveillance video [9] or assisting the visually impaired [2]. VQA is seen as a Turing test proxy in [3], where handcrafted question and image features are combined in a latent-world Bayesian framework. More recently, several end-to-end deep VQA networks [4, 6] are adapted from captioning models and utilize a recurrent LSTM network, which takes the ques- tion and Convolutional Neural Net (CNN) image features as input, and outputs the answer. However, these models do not have any explicit notion of object position, and do not support the computation of intermediate results based on spatial attention. We propose the Spatial Memory Network VQA (SMem- VQA) incorporating explicit spatial attention based on memory networks, which have recently been proposed for text Question Answering (QA) [10, 7]. The text QA mem- ory network stores textual knowledge in its “memory” in the form of sentences, and selects relevant sentences to in- fer the answer. Our SMem-VQA model stores the convo- lutional network outputs into the memory, which explicitly allows spatial attention over the image guided by the ques- tion. Fig. 1 shows the inference process of our two-hop model. In the first hop (middle), the attention process cap- What is the child standing on ? skateboard What color is the phone booth ? blue Figure 1. The inference process of our SMem-VQA two-hop model on examples from the VQA dataset [1]. v j Is there a cat in the basket? no word embedding CNN image embeddings predict answer next hop + memory (b) Word-guided attention word-guided attention See (b) (a) Overview . softmax first hop v 1 v T A W W E W att W att S att C 1 C T A W S s i C Q O A W S Figure 2. Our proposed Spatial Memory Network for Visual Ques- tion Answering (SMem-VQA). tures the correspondence between individual words in the question and image regions. High attention regions (bright areas) are marked with bounding boxes and the correspond- ing words are highlighted using the same color. In the sec- ond hop (right), the fine-grained evidence gathered in the first hop, as well as an embedding of the entire question, are used to collect more exact evidence to predict the answer. 2. Spatial Memory Network for VQA The question is represented as V = {v j | v j R N ; j = 1, ··· ,T }, where T is the maximum number of words in the question and N is the dimensionality of the word vectors. S = {s i | s i R M ; i =1, ··· ,L} represents the spatial CNN features at each of the L grid locations (the last convo- lutional layer of GoogLeNet (inception 5b/output) [8]). Fig. 2 (a) gives an overview of the proposed SMem-VQA 1

Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual ... · 2017-01-16 · Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual ... · 2017-01-16 · Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual

Ask, Attend and Answer: Exploring Question-Guided Spatial Attention forVisual Question Answering

Huijuan Xu and Kate SaenkoDepartment of Computer Science, UMass Lowell, USA

[email protected], [email protected]

Abstract

We propose the Spatial Memory Network model to solvethe problem of Visual Question Answering (VQA). Ourmodel stores neuron activations from different spatial re-gions of the image in its memory, and uses the questioninformation of different granularity to choose relevant re-gions for computing the answer with an explicit attentionmechanism. We evaluate our model on two published visualquestion answering datasets, DAQUAR [3] and VQA [1],and obtain improved results.

1. Introduction

Visual Question Answering (VQA) is an emerging in-terdisciplinary research problem, which has many real-lifeapplications, such as automatic querying of surveillancevideo [9] or assisting the visually impaired [2]. VQA is seenas a Turing test proxy in [3], where handcrafted questionand image features are combined in a latent-world Bayesianframework. More recently, several end-to-end deep VQAnetworks [4, 6] are adapted from captioning models andutilize a recurrent LSTM network, which takes the ques-tion and Convolutional Neural Net (CNN) image featuresas input, and outputs the answer. However, these modelsdo not have any explicit notion of object position, and donot support the computation of intermediate results basedon spatial attention.

We propose the Spatial Memory Network VQA (SMem-VQA) incorporating explicit spatial attention based onmemory networks, which have recently been proposed fortext Question Answering (QA) [10, 7]. The text QA mem-ory network stores textual knowledge in its “memory” inthe form of sentences, and selects relevant sentences to in-fer the answer. Our SMem-VQA model stores the convo-lutional network outputs into the memory, which explicitlyallows spatial attention over the image guided by the ques-tion. Fig. 1 shows the inference process of our two-hopmodel. In the first hop (middle), the attention process cap-

Whatisthechild standingon?skateboard

Whatcoloristhephonebooth?blue

Figure 1. The inference process of our SMem-VQA two-hopmodel on examples from the VQA dataset [1].

v jIs there a cat in the basket?

no

word embedding

CNN

image embeddings

predict answer

next hop

+

memory

(b) Word-guided attention

word-guided attention

See (b)

(a) Overview

.

softmax

first hop v1 vT

AW

WE

WattWatt

Satt

C1

CTAWS

s

i

C

Q

O

AWS

Figure 2. Our proposed Spatial Memory Network for Visual Ques-tion Answering (SMem-VQA).

tures the correspondence between individual words in thequestion and image regions. High attention regions (brightareas) are marked with bounding boxes and the correspond-ing words are highlighted using the same color. In the sec-ond hop (right), the fine-grained evidence gathered in thefirst hop, as well as an embedding of the entire question, areused to collect more exact evidence to predict the answer.

2. Spatial Memory Network for VQAThe question is represented as V = {vj | vj ∈ RN ; j =

1, · · · , T}, where T is the maximum number of words in thequestion and N is the dimensionality of the word vectors.S = {si | si ∈ RM ; i = 1, · · · , L} represents the spatialCNN features at each of the L grid locations (the last convo-lutional layer of GoogLeNet (inception 5b/output) [8]).

Fig. 2 (a) gives an overview of the proposed SMem-VQA

1

Page 2: Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual ... · 2017-01-16 · Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual

Table 1. Test-dev and test-standard results on the Open-Ended VQA dataset (in percentage). Models with ∗ use external training data inaddition to the VQA dataset.

test-dev test-standardOverall yes/no number others Overall yes/no number others

LSTM Q+I [1] 53.74 78.94 35.24 36.42 54.06 - - -ACK∗ [11] 55.72 79.23 36.13 40.08 55.98 79.05 36.10 40.61DPPnet∗ [5] 57.22 80.71 37.24 41.69 57.36 80.28 36.92 42.24iBOWIMG [12] 55.72 76.55 35.03 42.62 55.89 76.76 34.98 42.62SMem-VQA One-Hop 56.56 78.98 35.93 42.09 - - - -SMem-VQA Two-Hop 57.99 80.87 37.32 43.12 58.24 80.8 37.53 43.48

network. First, the CNN activation vectors S = {si} at im-age locations i are projected into the semantic space of thequestion word vectors vj using the “attention” visual em-bedding WA. The results are then used to infer spatial atten-tion weights Watt using the word-guided attention processshown in Fig. 2 (b). Word-guided attention predicts atten-tion determined by the question word that has the maximumcorrelation with embedded visual features at each location,e.g. choosing the word basket to attend to the location of thebasket in the image. The resulting spatial attention weightsWatt are then used to compute a weighted sum over thevisual features embedded via a separate “evidence” trans-formation WE , e.g., selecting evidence for the cat conceptat the basket location. Finally, the weighted evidence vec-tor Satt is combined with the full question embedding Q topredict the answer. An additional hop can repeat the processto gather more evidence.

Equations used in the first hop of SMem-VQA model arelisted as follows:

C = V · (S ·WA + bA)T (1)

Watt = softmax( maxi=1,··· ,T

(Ci)), Ci ∈ RL (2)

Satt = Watt · (S ·WE + bE) (3)Q = WQ · V + bQ (4)

P = softmax(WP · f(Satt +Q) + bP ) (5)

Equations used in the second hop of SMem-VQA modelare listed as follows:

Ohop1 = Satt +Q (6)Chop2 = (S ·WE + bE) ·Ohop1 (7)

Watt2 = softmax(Chop2) (8)Satt2 = Watt2 · (S ·WE2

+ bE2) (9)

P = softmax(WP · f(Ohop1 + Satt2) + bP ) (10)

The correlation matrix C in the first hop provides fine-grained local evidence from each word vectors V in thequestion, while the correlation vector Chop2 in next hopconsiders the global evidence from the whole question rep-resentation Q.

Table 2. Accuracy results on the DAQUAR dataset (in percentage).DAQUAR

Multi-World [3] 12.73Neural-Image-QA [4] 29.27Question LSTM [4] 32.32VIS+LSTM [6] 34.41Question BOW [6] 32.67IMG+BOW [6] 34.17SMem-VQA One-Hop 36.03SMem-VQA Two-Hop 40.07

3. ExperimentsResults on DAQUAR Dataset The 0-1 accuracy resultsof our SMem-VQA model and other baseline models on thereduced DAQUAR dataset [3] are shown in Tab. 2. Model-ing the question only with either the LSTM model or Ques-tion BOW model does equally well in comparison, indicat-ing the question text contains important prior informationfor predicting the answer. Also, on this dataset, VIS+LSTMmodel achieves better accuracy than Neural-Image-QAmodel; the former shows the image only at the first timestepof the LSTM, while the latter does so at each timestep. Incomparison, both our One-Hop model and Two-Hop spatialattention models outperform the IMG+BOW which con-catenates image and question features to predict answer, aswell as other baseline models.

Results on VQA Dataset We use the full release (V1.0)open-ended VQA dataset [1]. We report the test-dev andtest-standard results from the VQA evaluation server inTab. 1. The SMem-VQA Two-Hop model achieves an clearimprovement on test-dev and test-standard compared to theiBOWIMG model which uses the mean pooling of the con-volutional features (inception 5b/output) in each location,demonstrating the value of spatial attention. The SMem-VQA Two-Hop model has slightly better result than theDPPnet model. The DPPnet model uses a large-scale textcorpus to pre-train the Gated Recurrent Unit (GRU) net-work for question representation. Considering the fact thatour model does not use extra data to pre-train the word em-beddings or use extra knowledge base as in ACK model, itsresults are very competitive.

2

Page 3: Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual ... · 2017-01-16 · Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual

References[1] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L.

Zitnick, and D. Parikh. VQA: visual question answering.CoRR, abs/1505.00468, 2015.

[2] W. S. Lasecki, Y. Zhong, and J. P. Bigham. Increasingthe bandwidth of crowdsourced visual question answeringto better support blind users. In Proceedings of the 16th in-ternational ACM SIGACCESS conference on Computers &accessibility, pages 263–264. ACM, 2014.

[3] M. Malinowski and M. Fritz. A multi-world approach toquestion answering about real-world scenes based on uncer-tain input. CoRR, abs/1410.0210, 2014.

[4] M. Malinowski, M. Rohrbach, and M. Fritz. Ask your neu-rons: A neural-based approach to answering questions aboutimages. arXiv preprint arXiv:1505.01121, 2015.

[5] H. Noh, P. H. Seo, and B. Han. Image question answeringusing convolutional neural network with dynamic parameterprediction. arXiv preprint arXiv:1511.05756, 2015.

[6] M. Ren, R. Kiros, and R. S. Zemel. Exploring models anddata for image question answering. CoRR, abs/1505.02074,2015.

[7] S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus. End-to-end memory networks. arXiv preprint arXiv:1503.08895,2015.

[8] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.Going deeper with convolutions. In CVPR 2015, 2015.

[9] K. Tu, M. Meng, M. W. Lee, T. E. Choe, and S.-C. Zhu. Jointvideo and text parsing for understanding events and answer-ing queries. MultiMedia, IEEE, 21(2):42–70, 2014.

[10] J. Weston, S. Chopra, and A. Bordes. Memory networks.CoRR, abs/1410.3916, 2014.

[11] Q. Wu, P. Wang, C. Shen, A. v. d. Hengel, and A. Dick.Ask me anything: Free-form visual question answeringbased on knowledge from external sources. arXiv preprintarXiv:1511.06973, 2015.

[12] B. Zhou, Y. Tian, S. Sukhbaatar, A. Szlam, and R. Fer-gus. Simple baseline for visual question answering. arXivpreprint arXiv:1512.02167, 2015.

3