Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual ... ·...

Ask, Attend and Answer: Exploring Question-Guided Spatial Attention forVisual Question Answering

Huijuan Xu and Kate SaenkoDepartment of Computer Science, UMass Lowell, USA

hxu1@cs.uml.edu, saenko@cs.uml.edu

Abstract

We propose the Spatial Memory Network model to solvethe problem of Visual Question Answering (VQA). Ourmodel stores neuron activations from different spatial re-gions of the image in its memory, and uses the questioninformation of different granularity to choose relevant re-gions for computing the answer with an explicit attentionmechanism. We evaluate our model on two published visualquestion answering datasets, DAQUAR [3] and VQA [1],and obtain improved results.

1. Introduction

Visual Question Answering (VQA) is an emerging in-terdisciplinary research problem, which has many real-lifeapplications, such as automatic querying of surveillancevideo [9] or assisting the visually impaired [2]. VQA is seenas a Turing test proxy in [3], where handcrafted questionand image features are combined in a latent-world Bayesianframework. More recently, several end-to-end deep VQAnetworks [4, 6] are adapted from captioning models andutilize a recurrent LSTM network, which takes the ques-tion and Convolutional Neural Net (CNN) image featuresas input, and outputs the answer. However, these modelsdo not have any explicit notion of object position, and donot support the computation of intermediate results basedon spatial attention.

We propose the Spatial Memory Network VQA (SMem-VQA) incorporating explicit spatial attention based onmemory networks, which have recently been proposed fortext Question Answering (QA) [10, 7]. The text QA mem-ory network stores textual knowledge in its “memory” inthe form of sentences, and selects relevant sentences to in-fer the answer. Our SMem-VQA model stores the convo-lutional network outputs into the memory, which explicitlyallows spatial attention over the image guided by the ques-tion. Fig. 1 shows the inference process of our two-hopmodel. In the first hop (middle), the attention process cap-

Whatisthechild standingon?skateboard

Whatcoloristhephonebooth?blue

Figure 1. The inference process of our SMem-VQA two-hopmodel on examples from the VQA dataset [1].

v jIs there a cat in the basket?

word embedding

image embeddings

predict answer

next hop

memory

(b) Word-guided attention

word-guided attention

See (b)

(a) Overview

softmax

first hop v1 vT

WattWatt

Figure 2. Our proposed Spatial Memory Network for Visual Ques-tion Answering (SMem-VQA).

tures the correspondence between individual words in thequestion and image regions. High attention regions (brightareas) are marked with bounding boxes and the correspond-ing words are highlighted using the same color. In the sec-ond hop (right), the fine-grained evidence gathered in thefirst hop, as well as an embedding of the entire question, areused to collect more exact evidence to predict the answer.

2. Spatial Memory Network for VQAThe question is represented as V = {vj | vj ∈ RN ; j =

1, · · · , T}, where T is the maximum number of words in thequestion and N is the dimensionality of the word vectors.S = {si | si ∈ RM ; i = 1, · · · , L} represents the spatialCNN features at each of the L grid locations (the last convo-lutional layer of GoogLeNet (inception 5b/output) [8]).

Fig. 2 (a) gives an overview of the proposed SMem-VQA

Table 1. Test-dev and test-standard results on the Open-Ended VQA dataset (in percentage). Models with ∗ use external training data inaddition to the VQA dataset.

test-dev test-standardOverall yes/no number others Overall yes/no number others

LSTM Q+I [1] 53.74 78.94 35.24 36.42 54.06 - - -ACK∗ [11] 55.72 79.23 36.13 40.08 55.98 79.05 36.10 40.61DPPnet∗ [5] 57.22 80.71 37.24 41.69 57.36 80.28 36.92 42.24iBOWIMG [12] 55.72 76.55 35.03 42.62 55.89 76.76 34.98 42.62SMem-VQA One-Hop 56.56 78.98 35.93 42.09 - - - -SMem-VQA Two-Hop 57.99 80.87 37.32 43.12 58.24 80.8 37.53 43.48

network. First, the CNN activation vectors S = {si} at im-age locations i are projected into the semantic space of thequestion word vectors vj using the “attention” visual em-bedding WA. The results are then used to infer spatial atten-tion weights Watt using the word-guided attention processshown in Fig. 2 (b). Word-guided attention predicts atten-tion determined by the question word that has the maximumcorrelation with embedded visual features at each location,e.g. choosing the word basket to attend to the location of thebasket in the image. The resulting spatial attention weightsWatt are then used to compute a weighted sum over thevisual features embedded via a separate “evidence” trans-formation WE , e.g., selecting evidence for the cat conceptat the basket location. Finally, the weighted evidence vec-tor Satt is combined with the full question embedding Q topredict the answer. An additional hop can repeat the processto gather more evidence.

Equations used in the first hop of SMem-VQA model arelisted as follows:

C = V · (S ·WA + bA)T (1)

Watt = softmax( maxi=1,··· ,T

(Ci)), Ci ∈ RL (2)

Satt = Watt · (S ·WE + bE) (3)Q = WQ · V + bQ (4)

P = softmax(WP · f(Satt +Q) + bP ) (5)

Equations used in the second hop of SMem-VQA modelare listed as follows:

Ohop1 = Satt +Q (6)Chop2 = (S ·WE + bE) ·Ohop1 (7)

Watt2 = softmax(Chop2) (8)Satt2 = Watt2 · (S ·WE2

+ bE2) (9)

P = softmax(WP · f(Ohop1 + Satt2) + bP ) (10)

The correlation matrix C in the first hop provides fine-grained local evidence from each word vectors V in thequestion, while the correlation vector Chop2 in next hopconsiders the global evidence from the whole question rep-resentation Q.

Table 2. Accuracy results on the DAQUAR dataset (in percentage).DAQUAR

Multi-World [3] 12.73Neural-Image-QA [4] 29.27Question LSTM [4] 32.32VIS+LSTM [6] 34.41Question BOW [6] 32.67IMG+BOW [6] 34.17SMem-VQA One-Hop 36.03SMem-VQA Two-Hop 40.07

3. ExperimentsResults on DAQUAR Dataset The 0-1 accuracy resultsof our SMem-VQA model and other baseline models on thereduced DAQUAR dataset [3] are shown in Tab. 2. Model-ing the question only with either the LSTM model or Ques-tion BOW model does equally well in comparison, indicat-ing the question text contains important prior informationfor predicting the answer. Also, on this dataset, VIS+LSTMmodel achieves better accuracy than Neural-Image-QAmodel; the former shows the image only at the first timestepof the LSTM, while the latter does so at each timestep. Incomparison, both our One-Hop model and Two-Hop spatialattention models outperform the IMG+BOW which con-catenates image and question features to predict answer, aswell as other baseline models.

Results on VQA Dataset We use the full release (V1.0)open-ended VQA dataset [1]. We report the test-dev andtest-standard results from the VQA evaluation server inTab. 1. The SMem-VQA Two-Hop model achieves an clearimprovement on test-dev and test-standard compared to theiBOWIMG model which uses the mean pooling of the con-volutional features (inception 5b/output) in each location,demonstrating the value of spatial attention. The SMem-VQA Two-Hop model has slightly better result than theDPPnet model. The DPPnet model uses a large-scale textcorpus to pre-train the Gated Recurrent Unit (GRU) net-work for question representation. Considering the fact thatour model does not use extra data to pre-train the word em-beddings or use extra knowledge base as in ACK model, itsresults are very competitive.

References[1] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L.

Zitnick, and D. Parikh. VQA: visual question answering.CoRR, abs/1505.00468, 2015.

[2] W. S. Lasecki, Y. Zhong, and J. P. Bigham. Increasingthe bandwidth of crowdsourced visual question answeringto better support blind users. In Proceedings of the 16th in-ternational ACM SIGACCESS conference on Computers &accessibility, pages 263–264. ACM, 2014.

[3] M. Malinowski and M. Fritz. A multi-world approach toquestion answering about real-world scenes based on uncer-tain input. CoRR, abs/1410.0210, 2014.

[4] M. Malinowski, M. Rohrbach, and M. Fritz. Ask your neu-rons: A neural-based approach to answering questions aboutimages. arXiv preprint arXiv:1505.01121, 2015.

[5] H. Noh, P. H. Seo, and B. Han. Image question answeringusing convolutional neural network with dynamic parameterprediction. arXiv preprint arXiv:1511.05756, 2015.

[6] M. Ren, R. Kiros, and R. S. Zemel. Exploring models anddata for image question answering. CoRR, abs/1505.02074,2015.

[7] S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus. End-to-end memory networks. arXiv preprint arXiv:1503.08895,2015.

[8] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.Going deeper with convolutions. In CVPR 2015, 2015.

[9] K. Tu, M. Meng, M. W. Lee, T. E. Choe, and S.-C. Zhu. Jointvideo and text parsing for understanding events and answer-ing queries. MultiMedia, IEEE, 21(2):42–70, 2014.

[10] J. Weston, S. Chopra, and A. Bordes. Memory networks.CoRR, abs/1410.3916, 2014.

[11] Q. Wu, P. Wang, C. Shen, A. v. d. Hengel, and A. Dick.Ask me anything: Free-form visual question answeringbased on knowledge from external sources. arXiv preprintarXiv:1511.06973, 2015.

[12] B. Zhou, Y. Tian, S. Sukhbaatar, A. Szlam, and R. Fer-gus. Simple baseline for visual question answering. arXivpreprint arXiv:1512.02167, 2015.

Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual ... ·...

Documents

Mask-Guided Attention Network for Occluded Pedestrian ... · Mask-Guided Attention Network for Occluded Pedestrian Detection Yanwei Pang 1, Jin Xie , Muhammad Haris Khan2, Rao Muhammad

Diagnose like a Radiologist: Attention Guided ... › pdf › 1801.09927.pdf · Diagnose like a Radiologist: Attention Guided Convolutional Neural Network for Thorax Disease Classiﬁcation

AlphaNet: An Attention Guided Deep Network for …AlphaNet: An Attention Guided Deep Network for Automatic Image Matting Rishab Sharma Fynd Email: rishabsharma@fynd.com (Corresponding

POSHAN: Cardinal POS Pattern Guided Attention for News

Tell Me Where to Look: Guided Attention Inference Networkopenaccess.thecvf.com/content_cvpr_2018/papers/Li_Tell...Tell Me Where to Look: Guided Attention Inference Network Kunpeng

A GCSA Record? 2,611 Attend Cleveland Conventionarchive.lib.msu.edu/tic/golfd/article/1965mar67.pdf · A GCSA Record? 2,611 Attend Cleveland Convention ... and called attention to

Diagnose Like a Clinician: Third-Order Attention Guided ...ras.papercept.net/images/temp/IROS/files/0998.pdfDiagnose like a Clinician: Third-order Attention Guided Lesion Amplication

Structured Attention Guided Convolutional Neural …openaccess.thecvf.com/content_cvpr_2018/papers/Xu...Structured Attention Guided Convolutional Neural Fields for Monocular Depth

SVAM: Saliency-guided Visual Attention Modeling by

Visual Attention Guided Quality Assessment of Tone …users.ece.utexas.edu/~bevans/papers/2016/imagequality/toneMapped... · Visual Attention Guided Quality Assessment of Tone-Mapped

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Motion-Guided Spatial Time Attention for Video Object ...openaccess.thecvf.com/content_ICCVW_2019/papers/YouTube-VOS/… · Motion-Guided Spatial Time Attention for Video Object Segmentation

Unsupervised Attention-guided Image-to-Image Translationpapers.nips.cc/paper/7627-unsupervised-attention-guided-image-to-image... · Unsupervised Attention-guided Image-to-Image Translation

Mask-Guided Contrastive Attention Model for Person Re ... · Mask-guided Contrastive Attention Model for Person Re-Identiﬁcation Chunfeng Song1,3 Yan Huang1,3 Wanli Ouyang4 Liang

Visual attention guided bit allocation in video compressionilab.usc.edu/publications/doc/Li_etal11ivc.pdf · Visual attention guided bit allocation in video compression Zhicheng Lia,b,

MINA: Multilevel Knowledge-Guided Attention for Modeling … · 2019. 8. 27. · kNowledge-guided Attention networks (MINA) that predict heart diseases from ECG signals with in-tuitive

AMC: Attention guided Multi-modal Correlation …openaccess.thecvf.com/content_cvpr_2017/papers/Chen_AMC...AMC: Attention guided Multi-modal Correlation Learning for Image Search Kan

User-guided Hierarchical Attention Network for Multi-modal ...User-guided Hierarchical Attention Network for Multi-modal Social Image Popularity Prediction Wei Zhang Shanghai Key Laboratory

Argument Pair Extraction via Attention-guided Multi-Layer

Recurrent Networks for Guided Multi-Attention Classificationweb.cs.wpi.edu/~xkong/publications/papers/kdd20a.pdfRecurrent Networks for Guided Multi-Attention Classification KDD ’20,