22
ATTENTIONRNN: A Structured Spatial Attention Mechanism Siddhesh Khandelwal University of British Columbia [email protected] Leonid Sigal University of British Columbia [email protected] Abstract Visual attention mechanisms have proven to be integrally important constituent components of many modern deep neural architectures. They provide an efficient and effec- tive way to utilize visual information selectively, which has shown to be especially valuable in multi-modal learning tasks. However, all prior attention frameworks lack the ability to explicitly model structural dependencies among attention variables, making it difficult to predict consistent attention masks. In this paper we develop a novel structured spatial attention mechanism which is end-to-end trainable and can be integrated with any feed-forward convolutional neural network. This proposed AttentionRNN layer explic- itly enforces structure over the spatial attention variables by sequentially predicting attention values in the spatial mask in a bi-directional raster-scan and inverse raster-scan or- der. As a result, each attention value depends not only on local image or contextual information, but also on the pre- viously predicted attention values. Our experiments show consistent quantitative and qualitative improvements on a variety of recognition tasks and datasets; including image categorization, question answering and image generation. 1. Introduction In recent years, computer vision has made tremendous progress across many complex recognition tasks, including image classification [16, 41], image captioning [4, 13, 35, 37], image generation [27, 38, 40] and visual question an- swering (VQA) [2, 5, 12, 21, 24, 28, 34, 36]. Arguably, much of this success can be attributed to the use of visual attention mechanisms which, similar to the human percep- tion, identify the important regions of an image. Attention mechanisms typically produce a spatial mask for the given image feature tensor. In an ideal scenario, the mask is ex- pected to have higher activation values over the features cor- responding to the regions of interest, and lower activation values everywhere else. For tasks that are multi-modal in nature, like VQA, a query (e.g., a question) can additionally be used as an input to generate the mask. In such cases, the Figure 1: AttentionRNN. Illustration of the proposed struc- tured attention network as a module for down stream task. attention activation is usually a function of similarity be- tween the corresponding encoding of the image region and the question in a pre-defined or learned embedding space. Existing visual attention mechanisms can be broadly characterized into two categories: global or local; see Fig- ure 2a and 2b respectively for illustration. Global mech- anisms predict all the attention variables jointly, typically based on a dense representation of the image feature map. Such mechanisms are prone to overfitting and are only computationally feasible for low-resolution image features. Therefore, typically, these are only applied at the last con- volutional layer of a CNN [21, 42]. The local mechanisms generate attention values for each spatial attention vari- able independently based on corresponding image region [5, 24, 25](i.e., feature column) or with the help of local context [25, 32, 40]. As such, local attention mechanisms can be applied at arbitrary resolution and can be used at var- ious places within a CNN network (e.g., in [25] authors use them before each sub-sampling layer and in [32] as part of each residual block). However, all the aforementioned mod- els lack explicit structure in the generated attention masks. This is often exhibited by lack of coherence or sharp discon- tinuities in the generated attention activation values [25]. Consider a VQA model attending to regions required to answer the question, “Do the two spheres next to each other have the same color?”. Intuitively, attention mecha- nisms should focus on the two spheres in question. Further- more, attention region corresponding to one sphere should inform the estimates for attention region for the other, both in terms of shape and size. However, most traditional atten- 1 arXiv:1905.09400v1 [cs.CV] 22 May 2019

Siddhesh Khandelwal Leonid Sigal - arXivSiddhesh Khandelwal University of British Columbia [email protected] Leonid Sigal University of British Columbia [email protected] Abstract

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Siddhesh Khandelwal Leonid Sigal - arXivSiddhesh Khandelwal University of British Columbia skhandel@cs.ubc.ca Leonid Sigal University of British Columbia lsigal@cs.ubc.ca Abstract

ATTENTIONRNN: A Structured Spatial Attention Mechanism

Siddhesh KhandelwalUniversity of British Columbia

[email protected]

Leonid SigalUniversity of British Columbia

[email protected]

Abstract

Visual attention mechanisms have proven to be integrallyimportant constituent components of many modern deepneural architectures. They provide an efficient and effec-tive way to utilize visual information selectively, which hasshown to be especially valuable in multi-modal learningtasks. However, all prior attention frameworks lack theability to explicitly model structural dependencies amongattention variables, making it difficult to predict consistentattention masks. In this paper we develop a novel structuredspatial attention mechanism which is end-to-end trainableand can be integrated with any feed-forward convolutionalneural network. This proposed AttentionRNN layer explic-itly enforces structure over the spatial attention variables bysequentially predicting attention values in the spatial maskin a bi-directional raster-scan and inverse raster-scan or-der. As a result, each attention value depends not only onlocal image or contextual information, but also on the pre-viously predicted attention values. Our experiments showconsistent quantitative and qualitative improvements on avariety of recognition tasks and datasets; including imagecategorization, question answering and image generation.

1. IntroductionIn recent years, computer vision has made tremendous

progress across many complex recognition tasks, includingimage classification [16, 41], image captioning [4, 13, 35,37], image generation [27, 38, 40] and visual question an-swering (VQA) [2, 5, 12, 21, 24, 28, 34, 36]. Arguably,much of this success can be attributed to the use of visualattention mechanisms which, similar to the human percep-tion, identify the important regions of an image. Attentionmechanisms typically produce a spatial mask for the givenimage feature tensor. In an ideal scenario, the mask is ex-pected to have higher activation values over the features cor-responding to the regions of interest, and lower activationvalues everywhere else. For tasks that are multi-modal innature, like VQA, a query (e.g., a question) can additionallybe used as an input to generate the mask. In such cases, the

Figure 1: AttentionRNN. Illustration of the proposed struc-tured attention network as a module for down stream task.

attention activation is usually a function of similarity be-tween the corresponding encoding of the image region andthe question in a pre-defined or learned embedding space.

Existing visual attention mechanisms can be broadlycharacterized into two categories: global or local; see Fig-ure 2a and 2b respectively for illustration. Global mech-anisms predict all the attention variables jointly, typicallybased on a dense representation of the image feature map.Such mechanisms are prone to overfitting and are onlycomputationally feasible for low-resolution image features.Therefore, typically, these are only applied at the last con-volutional layer of a CNN [21, 42]. The local mechanismsgenerate attention values for each spatial attention vari-able independently based on corresponding image region[5, 24, 25] (i.e., feature column) or with the help of localcontext [25, 32, 40]. As such, local attention mechanismscan be applied at arbitrary resolution and can be used at var-ious places within a CNN network (e.g., in [25] authors usethem before each sub-sampling layer and in [32] as part ofeach residual block). However, all the aforementioned mod-els lack explicit structure in the generated attention masks.This is often exhibited by lack of coherence or sharp discon-tinuities in the generated attention activation values [25].

Consider a VQA model attending to regions requiredto answer the question, “Do the two spheres next to eachother have the same color?”. Intuitively, attention mecha-nisms should focus on the two spheres in question. Further-more, attention region corresponding to one sphere shouldinform the estimates for attention region for the other, bothin terms of shape and size. However, most traditional atten-

1

arX

iv:1

905.

0940

0v1

[cs

.CV

] 2

2 M

ay 2

019

Page 2: Siddhesh Khandelwal Leonid Sigal - arXivSiddhesh Khandelwal University of British Columbia skhandel@cs.ubc.ca Leonid Sigal University of British Columbia lsigal@cs.ubc.ca Abstract

(a) Global Attention (b) Local Attention (c) Structured AttentionFigure 2: Different types of attention mechanisms. Compared are (a) global and (b) local attention mechanisms exploredin prior works and proposed structured AttentionRNN architecture in (c).

tion mechanisms have no ability to encode such dependen-cies. Recent modularized architectures [1, 10] are able toaddress some of these issues with attentive reasoning, butthey are relevant only for a narrow class of VQA problems.Such models are inapplicable to scenarios involving self-attention [32] or generative architectures, where granularshape-coherent attention masks are typically needed [40].

In this paper, we argue that these challenges can be ad-dressed by structured spatial attention. Such class of atten-tion models can potentially encode arbitrary constraints be-tween attention variables, be it top-down structured knowl-edge or local/global consistency and dependencies. To en-force this structure, we propose a novel attention mecha-nism which we refer to as AttentionRNN (see Figure 2c forillustration). We draw inspiration from the Diagonal BiL-STM architecture proposed in [30]. As such, AttentionRNNgenerates a spatial attention mask by traversing the imagediagonally, starting from a corner at the top and going to theopposite corner at the bottom. When predicting the atten-tion value for a particular image feature location, structureis enforced by taking into account: (i) local image contextaround the corresponding image feature location and, moreimportantly, (ii) information about previously generated at-tention values.

One of the key benefits of our model is that it can be usedagnostically in any existing feed-forward neural network atone or multiple convolutional feature levels (see Figure 1).To support this claim, we evaluate our method on differenttasks and with different backbone architectures. For VQA,we consider the Progressive Attention Network (PAN) [25]and Multimodal Compact Bilinear Pooling (MCB) [5]. Forimage generation, we consider the Modular Generative Ad-versarial Networks (MGAN) [40]. For image categoriza-tion, we consider the Convolutional Block Attention Mod-ule (CBAM) [32]. When we replace the existing attentionmechanisms in these models with our proposed Attention-RNN, we observe higher overall performance along withbetter spatial attention masks.

Contributions: Our contributions can be summarized as

follows: (1) We propose a novel spatial attention mecha-nism which explicitly encodes structure over the spatial at-tention variables by sequentially predicting values. As aconsequence, each attention value in a spatial mask dependsnot only on local image or contextual information, but alsoon previously predicted attention values. (2) We illustratethat this general attention mechanism can work with anyexisting model that relies on, or can benefit from, spatialattention; showing its effectiveness on a variety of differenttasks and datasets. (3) Through experimental evaluation, weobserve improved performance and better attention maskson VQA, image generation and image categorization tasks.

2. Related Work

2.1. Visual Attention

Visual attention mechanisms have been widely adoptedin the computer vision community owing to their ability tofocus on important regions in an image. Even though thereis a large variety of methods that deploy visual attention,they can be categorized based on key properties of the un-derlying attention mechanisms. For ease of understanding,we segregate related research using these properties.Placement of attention in a network. Visual attentionmechanisms are typically applied on features extracted bya convolutional neural network (CNN). Visual attention caneither be applied: (1) at the end of a CNN network, or (2)iteratively at different layers within a CNN network.

Applying visual attention at the end of a CNN networkis the most straightforward way of incorporating visual at-tention in deep models. This has led to an improvementin model performance across a variety of computer visiontasks, including image captioning [4, 35, 37], image recog-nition [41], VQA [21, 34, 36, 42], and visual dialog [24].

On the other hand, there have been several approachesthat iteratively apply visual attention, operating over mul-tiple CNN feature layers [11, 25, 32]. Seo et al. [25] pro-gressively apply attention after each pooling layer of a CNNnetwork to accurately attend over target objects of various

Page 3: Siddhesh Khandelwal Leonid Sigal - arXivSiddhesh Khandelwal University of British Columbia skhandel@cs.ubc.ca Leonid Sigal University of British Columbia lsigal@cs.ubc.ca Abstract

scales and shape. Woo et al. [32] use a similar approach,but instead apply two different types of attention - one thatattends over feature channels and the other that attends overthe spatial domain.Context used to compute attention. Attention mecha-nisms differ on how much information they use to computethe attention mask. They can be global, that is use all theavailable image context to jointly predict the values in anattention mask [21, 35]. As an example, [21] propose anattention mechanism for VQA where the attention mask iscomputed by projecting the image features into some latentspace and then computing its similarity with the question.

Attention mechanisms can also be local, where-in atten-tion for each variable is generated independently or using acorresponding local image region [5, 24, 25, 32, 40]. Forexample, [25, 32, 40] use a k × k convolutional kernel tocompute a particular attention value, allowing them to cap-ture local information around the corresponding location.None of the aforementioned works enforce structure overthe generated attention masks. Structure over the values ofan image, however, has been exploited in many autoregres-sive models trained to generate images. The next sectionbriefly describes the relevant work in this area.

2.2. Autoregressive Models for Image Generation

Generative image modelling is a key problem in com-puter vision. In recent years, there has been significant workin this area [6, 14, 22, 23, 30, 39, 40]. Although most workuses stochastic latent variable models like VAEs [14, 22]or GANs [6, 39, 40], autoregressive models [23, 29, 30]provide a more tractable approach to model the joint distri-bution over the pixels. These models leverage the inherentstructure over the image, which enables them to express thejoint distribution as a product of conditional distributions- where the value of the next pixel is dependent on all thepreviously generated pixels.

Van et al. [30] propose a PixelRNN network that usesLSTMs [9] to model this sequential dependency betweenthe pixels. They also introduce a variant, called PixelCNN,that uses CNNs instead of LSTMs to allow for faster com-putations. They later extend PixelCNN to allow the modelto be conditioned on some query [29]. Finally, [23] pro-pose further simplifications to the PixelCNN architecture toimprove performance.

Our work draws inspiration from the PixelRNN archi-tecture proposed in [30]. We extend PixelRNN to modelstructural dependencies within attention masks.

3. Approach

Given an input image feature X ∈ Rh×m×n, our goalis to predict a spatial attention mask A ∈ Rm×n, whereh represents the number of channels, and m and n are the

number of rows and the columns of X respectively. LetX = {x1,1, . . . ,xm,n}, where xi,j ∈ Rh be a column fea-ture corresponding to the spatial location (i, j). Similarly,let A = {a1,1, . . . , am,n}, where ai,j ∈ R be the attentionvalue corresponding to xi,j . Formally, we want to modelthe conditional distribution p(A | X). In certain problems,we may also want to condition on other auxiliary informa-tion in addition to X, e.g. in VQA on a question. Whilein this paper we formulate and model attention probabilisti-cally, most traditional attention models directly predict theattention values, which can be regarded as a point estimate(or expected value) of A under our formulation.

Global attention mechanisms [21, 42] predict A directlyfrom X using a fully connected layer. Although this makesno assumptions on the factorization of p(A |X), it becomesintractable as the size of X increases. This is mainly due tothe large number of parameters in the fully connected layer.

Local attention mechanisms [24, 25, 32, 40], on the otherhand, make strong independence assumptions on the inter-actions between the attention variables ai,j . Particularly,they assume each attention variable ai,j to be independentof other variables given some local spatial context δ(xi,j).More formally, for local attention mechanisms,

p (A |X) ≈i=m,j=n∏i=1,j=1

p (ai,j | δ(xi,j)) (1)

Even though such a factorization improves tractability, thestrong independence assumption often leads to attentionmasks that lack coherence and contain sharp discontinuities.

Contrary to local attention mechanisms, our proposedAttentionRNN tries to capture some of the structural depen-dencies between the attention variables ai,j . We assume

p(A |X) =

i=m,j=n∏i=1,j=1

p (ai,j | a<i,j , X)) (2)

≈i=m,j=n∏i=1,j=1

p (ai,j | a<i,j , δ(xi,j)) (3)

where a<i,j = {a1,1, . . . , ai−1,j} (The blue and green re-gion in Figure 3). That is, each attention variable ai,j is nowdependent on : (i) the local spatial context δ(xi,j), and (ii)all the previous attention variables a<i,j . Note that Equa-tion 2 is just a direct application of the chain rule. Similar tolocal attention mechanisms, and to reduce the computationoverhead, we assume that a local spatial context δ(xi,j) is asufficient proxy for the image features X when computingai,j . Equation 3 describes the final factorization we assume.

One of the key challenges in estimating A based onEquation 3 is to efficiently compute the term a<i,j . Astraightforward solution is to use a recurrent neural network(e.g. LSTMs) to summarize the previously predicted atten-tion values a<i,j into its hidden state. This is a common

Page 4: Siddhesh Khandelwal Leonid Sigal - arXivSiddhesh Khandelwal University of British Columbia skhandel@cs.ubc.ca Leonid Sigal University of British Columbia lsigal@cs.ubc.ca Abstract

approach employed in many sequence prediction methods[3, 26, 31]. However, while sequences have a well definedordering, image features can be traversed in multiple waysdue to their spatial nature. Naively parsing the image alongits rows using an LSTM, though provides an estimate fora<i,j , fails to correctly encode the necessary informationrequired to predict ai,j . As an example, the LSTM will tendto forget information from the neighbouring variable ai−1,j

as it was processed n time steps ago.To alleviate this issue, AttentionRNN instead parses the

image in a diagonal fashion, starting from a corner at thetop and going to the opposite corner in the bottom. It buildsupon the Diagonal BiLSTM layer proposed by [30] to ef-ficiently perform this traversal. The next section describesour proposed AttentionRNN mechanism in detail.

3.1. AttentionRNN

Our proposed structured attention mechanism buildsupon the Diagonal BiLSTM layer proposed by [30]. Weemploy two LSTMs, one going from the top-left to bottom-right corner (Ll) and the other from the top-right to thebottom-left corner (Lr).

As mentioned in Equation 3, for each ai,j , our objectiveis to estimate p (ai,j | a<i,j , δ(xi,j)). We assume that thiscan be approximated via a combination of two distributions.

p (ai,j |a<i,j) = Γ 〈p (ai,j |a<i,<j) , p (ai,j |a<i,>j)〉 (4)

where a<i,<j is the set of attention variables to the top andleft (blue region in Figure 3) of ai,j , a<i,>j is the set of at-tention variables to the top and right of ai,j (green region inFigure 3), and Γ is some combination function. For brevity,we omit explicitly writing δ(xi,j). Equation 4 is furthersimplified by assuming that all distributions are Gaussian.

p (ai,j |a<i,<j) ≈ N(µli,j , σ

li,j

)p (ai,j |a<i,>j) ≈ N

(µri,j , σ

ri,j

)p (ai,j |a<i,j) ≈ N (µi,j , σi,j)

(5)

where,

(µli,j , σli,j) = fl (a<i,<j) ; (µri,j , σ

ri,j) = fr (a<i,>j)

(µi,j , σi,j) = Γ(µli,j , σ

li,j , µ

ri,j , σ

ri,j

) (6)

fl and fr are fully connected layers. Our choice for thecombination function Γ is explained in Section 3.2. Foreach ai,j , Ll is trained to estimate (µli,j , σ

li,j), and Lr is

trained to estimate (µri,j , σri,j). We now explain the compu-

tation forLl. Lr is analogous and has the same formulation.Ll needs to correctly approximate a<i,<j in order to ob-

tain a good estimate of (µli,j , σli,j). As we are parsing the

image diagonally, from Figure 3 it can be seen that the fol-lowing recursive relation holds,

a<i,<j = f(a<i−1,<j , a<i,<j−1) (7)

Figure 3: Skewing operation. This makes it easier to com-pute convolutions along the diagonal. The arrows indicatedependencies between attention values. To obtain the im-age on the right, each row of the left image is offset by oneposition with respect to its previous row.

That is, for each location (i, j), Ll only needs to considertwo attention variables- one above and the other to the left;[30] show that this is sufficient for it to be able to obtaininformation from all the previous attention variables.

To make computations along the diagonal easier, similarto [30], we first skew X into a new image feature X. Figure3 illustrates the skewing procedure. Each row of X is offsetby one position with respect to the previous row. X is nowan image feature of size h ×m × (2n − 1). Traversing Xin a diagonal fashion from top left to bottom right is nowequivalent to traversing X along its columns from left toright. As spatial locations (i − 1, j) and (i, j − 1) in X

are now in the same column in X, we can implement therecursion described in Equation 7 efficiently by performingcomputations on an entire column of X at once.

Let Xj denote the jth column of X. Also, let hlj−1 andclj−1 respectively denote the hidden and memory state of Ll

before processing Xj . Both hlj−1 and clj−1 are tensors ofsize t × m, where t is the number of latent features. Thenew hidden and memory state is computed as follows.

[oj , fj , ij ,gj ] = σ(Kh ~ hlj−1 + Kx ~ Xc

j

)clj = fj � clj−1 + ij � gj

hlj = oj � tanh(clj)

(8)

Here ~ represents the convolution operation and � repre-sents element-wise multiplication. Kh is a 2 × 1 convolu-tion kernel which effectively implements the recursive rela-tion described in Equation 7, and Kx is a 1× 1 convolutionkernel. Both Kh and Ku produce a tensor of size 4t ×m.Xcj is the jth column of the skewed local context Xc, which

is obtained as follows.

Xc = skew (Kc ~X) (9)

where Kc is a convolutional kernel that captures a δ-sizecontext. For tasks that are multi-modal in nature, a query Qcan additionally be used to condition the generation of ai,j .This allows the model to generate different attention maskfor the same image features depending on Q. For example,

Page 5: Siddhesh Khandelwal Leonid Sigal - arXivSiddhesh Khandelwal University of British Columbia skhandel@cs.ubc.ca Leonid Sigal University of British Columbia lsigal@cs.ubc.ca Abstract

Figure 4: Block AttentionRNN for γ = 2. The input is firstdown-sized using a γ×γ convolutional kernel. Attention iscomputed on this smaller map.

in tasks like VQA, the relevant regions of an image willdepend on the question asked. The nature of Q will alsodictate the encoding procedure. As an example, if Q is anatural language question, it can be encoded using a LSTMlayer. Q can be easily incorporated into AttentionRNN byconcatenating it with Xc before passing it to Equation 8.

Let hl = {hl1, . . . , hl2n−1} be the set of all hidden statesobtained from Ll, and hl be the set obtained by applyingthe reverse skewing operation on hl. For each ai,j , a<i,<jis then simply the (i, j) spatial element of hl. a<i,>j canbe obtained by repeating the aforementioned process forLr,which traverses X from top-right to bottom-left. Note thatthis is equivalent to running Lr from top-left to bottom-right after mirroring X along the column dimension, andthen mirroring the output hidden states hr again. Similar to[30], hr is shifted down by one row to prevent a<i,>j fromincorporating future attention values.

Once a<i,<j and a<i,>j are computed (as discussedabove), we can obtain the Gaussian distribution for the at-tention variableN (µi,j , σi,j) by following Equation 6. Theattention ai,j could then be obtained by either sampling avalue from N (µi,j , σi,j) or simply by taking the expecta-tion and setting ai,j = µi,j . For most problems, as we willsee in the experiment section, taking the expectation is go-ing to be most efficient and effective. However, samplingmaybe useful in cases where attention is inherently multi-modal. Focusing on different modes using coherent masksmight be more beneficial in such situations.

3.2. Combination Function

The choice of the combination function Γ implicitly im-poses some constraints on the interaction between the dis-tributions N

(µli,j , σ

li,j

)and N

(µri,j , σ

ri,j

). For example,

assumption of independence would dictate a simple prod-uct for Γ, with resulting operations to produce (µi,j , σi,j)being expressed in closed form. However, it is clear thatindependence is unlikely to hold due to image correlations.To allow for a more flexible interaction between variablesand combination function, we instead use a fully connectedlayer to learn the appropriate Γ for a particular task.

µi,j , σi,j = fcomb(µli,j , σ

li,j , µ

ri,j , σ

ri,j

)(10)

(a) MREF (b) MDIST (c) MBG

Figure 5: Synthetic Dataset Samples. Example imagestaken from the three synthetic datasets proposed in [24].

3.3. Block AttentionRNN

Due to the poor performance of LSTMs over large se-quences, the AttentionRNN layer doesn’t scale well to largeimage feature maps. We introduce a simple modification tothe method described in Section 3.1 to alleviate this prob-lem, which we refer to as Block AttentionRNN (BRNN).

BRNN reduces the size of the input feature map X be-fore computing the attention mask. This is done by splittingX into smaller blocks, each of size γ×γ. This is equivalentto down-sampling the original image X to Xds as follows.

Xds = Kds ~X (11)

where Kds is a convolution kernel of size γ×γ applied withstride γ. In essence, each value in Xds now corresponds toa γ × γ region in X.

Instead of predicting a different attention probability foreach individual spatial location (i, j) in X, BRNN predictsa single probability for each γ × γ region. This is doneby first computing the attention mask Ads for the down-sampled image Xds using AttentionRNN (Section 3.1), andthen Ads is then scaled up using a transposed convolutionallayer to obtain the attention mask A for the original imagefeature X. Figure 4 illustrates the BRNN procedure.

BRNN essentially computes a coarse attention mask forX. Intuitively, this coarse attention can be used in the firstfew layers of a deep CNN network to identify the key regionblocks in the image. The later layers can use this coarseinformation to generate a more granular attention mask.

4. ExperimentsTo show the efficacy and generality of our approach, we

conduct experiments over four different tasks: visual at-tribute prediction, image classification, visual question an-swering (VQA) and image generation. We highlight thatour goal is not to necessarily obtain the absolute highest rawperformance (although we do in many of our experiments),but to show improvements from integrating AttentionRNNinto existing state-of-the-art models across a variety of tasksand architectures. Due to space limitations, all model archi-tectures and additional visualizations are described in thesupplementary material.

Page 6: Siddhesh Khandelwal Leonid Sigal - arXivSiddhesh Khandelwal University of British Columbia skhandel@cs.ubc.ca Leonid Sigal University of British Columbia lsigal@cs.ubc.ca Abstract

Attention MREF MDIST MBGRel.

RuntimeSAN [35] 83.42 80.06 58.07 1x¬CTX [25] 95.69 89.92 69.33 1.08xCTX [25] 98.00 95.37 79.00 1.10xARNN∼

ind 98.72 96.70 83.68

4.73xARNNind 98.58 96.29 84.27ARNN∼ 98.65 96.82 83.74ARNN 98.93 96.91 85.84

Table 1: Color prediction accuracy. Results are in % onMREF, MDIST and MBG datasets. Our AttentionRNN-based model, CNN+ARNN, outperforms all the baselines.

4.1. Visual Attribute Prediction

Datasets. We experiment on the synthetic MREF, MDISTand MBG datasets proposed in [25]. Figure 5 shows exam-ple images from the datasets. The images in the datasets arecreated from MNIST [17] by sampling five to nine distinctdigits with different colors (green, yellow, white, red, orblue) and varying scales (between 0.5 and 3.0). The datasetshave images of size 100 x 100 and only differ in how thebackground is generated. MREF has a black background,MDIST has a black background with some Gaussian noise,and MBG has real images sampled from the SUN Database[33] as background. The training, validation and test setscontain 30,000, 10,000 and 10,000 images respectively.Experimental Setup. The performance of AttentionRNN(ARNN) is compared against two local attention mecha-nisms proposed in [25], which are referred as ¬CTX andCTX. ARNN assumes ai,j = µi,j , δ = 3, where µi,j isdefined in Equation 10. To compute the attention for a par-ticular spatial location (i, j), CTX uses a δ = 3 local con-text around (i, j), whereas ¬CTX only uses the informa-tion from location (i, j). We additionally define three vari-ants of ARNN: i) ARNN∼ where each ai,j is sampled fromN (µi,j , σi,j), ii) ARNNind where the combination func-tion Γ assumes the input distributions are independent, andiii) ARNN∼

ind where Γ assumes independence and ai,j issampled. The soft attention mechanism (SAN) proposed by[35] is used as an additional baseline. The same base CNNarchitecture is used for all the attention mechanisms for faircomparison. The CNN is composed of four stacks of 3× 3convolutions with 32 channels followed by 2× 2 max pool-ing layer. SAN computes attention only on the output ofthe last convolution layer, while ¬CTX, CTX and all vari-ants of ARNN are applied after each pooling layer. Givenan image, the models are trained to predict the color of thenumber specified by a query. Chance performance is 20%.Results. Table 1 shows the color prediction accuracy of var-ious models on MREF, MDIST and MBG datasets. It canbe seen that ARNN and all its variants clearly outperformthe other baseline methods. The difference in performance

Attention Corr. Scale0.5-1.0 1.0-1.5 1.5-2.0 2.0-2.5 2.5-3.0

SAN [35] 0.15 53.05 74.85 72.18 59.52 54.91¬CTX [25] 0.28 68.20 76.37 73.30 61.60 57.28CTX [25] 0.31 77.39 87.13 84.96 75.59 63.72ARNN∼

ind 0.36 82.23 89.41 86.46 84.52 81.35ARNNind 0.34 82.89 89.47 88.34 84.22 80.00ARNN∼ 0.39 82.23 89.41 86.46 84.52 81.35ARNN 0.42 84.45 91.40 86.84 88.39 82.37

Table 2: Mask Correctness and Scale experiment onMBG. The “Corr.” column lists the mask correctness met-ric proposed by [19]. The “Scale” column shows the colorprediction accuracy in % for different scales.

is amplified for the more noisy MBG dataset, where ARNNis 6.8% better than the closest baseline. ARNNind performspoorly compared to ARNN, which furthers the reasoning ofusing a neural network to model Γ instead of assuming in-dependence. Similar to [25], we also evaluate the modelson their sensitivity to the size of the target. The test set isdivided into five uniform scale intervals for which model ac-curacy is computed. Table 2 shows the results on the MBGdataset. ARNN is robust to scale variations and performsconsistently well on small and large targets. We also testthe correctness of the mask generated using the metric pro-posed by [19], which computes the percentage of attentionvalues in the region of interest. For models that apply at-tention after each pooling layer, the masks from differentlayers are combined by upsampling and taking a productover corresponding pixel values. The results are shown forthe MBG dataset in Table 2. ARNN is able to more accu-rately attend to the correct regions, which is evident fromthe high correctness score.

From Tables 1 and 2, it can be seen that ARNN∼ pro-vides no significant advantage over its deterministic coun-terpart. This can be attributed to the datasets encouragingpoint estimates, as each input query can only have one cor-rect answer. As a consequence, for each ai,j , σi,j was ob-served to underestimate variance. However, in situationswhere an input query can have multiple correct answers,ARNN∼ can be used to generate diverse attention masks.To corroborate this claim, we test the pre-trained ARNN∼

on images that are similar to the MBG dataset but have thesame digit in multiple colors. Figure 6a shows the individ-ual layer attended feature maps for three different samplesfrom ARNN∼ for a fixed image and query. For the query“9”, ARNN∼ is able to identify the three modes. Note thatsince σi,j’s were underestimated due to aforementioned rea-sons, they were scaled up before generating the samples.Despite being underestimated σi,j’s still capture crucial in-formation.Inverse Attribute Prediction. Figure 6a leads to an inter-esting observation regarding the nature of the task. Even

Page 7: Siddhesh Khandelwal Leonid Sigal - arXivSiddhesh Khandelwal University of British Columbia skhandel@cs.ubc.ca Leonid Sigal University of British Columbia lsigal@cs.ubc.ca Abstract

Figure 6: Qualitative analysis of the attention masks. (a) Layer-wise attended feature maps sampled from ARNN∼. Thesamples span all the modes in the image. (b) Layer-wise attended feature maps generated by different mechanisms visualizedon images from MBGinv dataset. Additional visualizations are shown in the supplementary material.

ScaleTotal 0.5-1.0 1.0-1.5 1.5-2.0 2.0-2.5 2.5-3.0

NONE 91.43 85.63 92.57 94.96 94.77 93.59ARNN 91.09 84.89 92.25 94.24 94.70 94.52

BRNNγ=3 91.67 85.97 93.46 94.81 94.35 93.68BRNNγ=2 92.68 88.10 94.23 95.32 94.80 94.01

Table 3: Block AttentionRNN. Ablation results on MBGb

dataset. AttentionRNN (ARNN) and Block AttentionRNN(BRNN) with block sizes of 2 and 3 are compared.

though ARNN∼ is able to identify the correct number, itonly needs to focus on a tiny part of the target region tobe able to accurately classify the color. To further demon-strate the ARNN’s ability to model longer dependencies,we test the performance of ARNN, CTX and ¬CTX on theMBGinv dataset, which defines the inverse attribute pre-diction problem - given a color identify the number corre-sponding to that color. The base CNN architecture is iden-tical to the one used in the previous experiment. ARNN,CTX and ¬CTX achieve an accuracy of 72.77%, 66.37%and 40.15% and a correctness score[19] of 0.39, 0.24 and0.20 respectively. Figure 6b shows layer-wise attended fea-ture maps for the three models. ARNN is able to capturethe entire number structure, whereas the other two methodsonly focus on a part of the target region. Even though CTXuses some local context to compute the attention masks, itfails to identify the complete structure for the number “0”.A plausible reason for this is that a 3×3 local context is toosmall to capture the entire target region. As a consequence,the attention mask is computed in patches. CTX maintainsno information about the previously computed attention val-ues, and therefore is unable to assign correlated attentionscores for all the different target region patches. ARNN, onthe other hand, captures constraints between attention vari-ables, making it much more effective in this situation.Scalability of ARNN. The results shown in Table 1 corre-spond to models trained on 100× 100 input images, wherethe first attention layer is applied on an image feature of size50 × 50. To analyze the performance of ARNN on com-

paratively larger image features, we create a new dataset of224×224 images which we refer to as MBGb. The data gen-eration process for MBGb is identical to MBG. We performan ablation study analyzing the effect of using Block Atten-tionRNN (BRNN) (Section 3.3) instead of ARNN on largerimage features. For the base architecture, the ARNN modelfrom the previous experiment is augmented with an addi-tional stack of convolutional and max pooling layer. Thedetailed architecture is mentioned in the supplementary ma-terial. Table 3 shows the color prediction accuracy on differ-ent scale intervals for the MBGb dataset. As the first atten-tion layer is now applied on a feature map of size 112×112,ARNN performs worse than the case when no attention(NONE) is applied due to the poor tractability of LSTMsover large sequences. BRNNγ=2, on the other hand, is ableto perform better as it reduces the image feature size beforeapplying attention. However, there is a considerable differ-ence in the performance of BRNN when γ = 2 and γ = 3.When γ = 3, BRNN applies a 3×3 convolution with stride3. This aggressive size reduction causes loss of information.

4.2. Image ClassificationDataset. We use the CIFAR-100 [15] to verify the perfor-mance of AttentionRNN on the task of image classification.The dataset consists of 60,000 32 × 32 images from 100classes. The training/test set contain 50,000/10,000 images.Experimental Setup. We augment ARNN to the convo-lution block attention module (CBAM) proposed by [32].For a given feature map, CBAM computes two differenttypes of attentions: 1) channel attention that exploits theinter-channel dependencies in a feature map, and 2) spatialattention that uses local context to identify relationships inthe spatial domain. We replace only the spatial attention inCBAM with ARNN. This modified module is referred toas CBAM+ARNN. ResNet18 [8] is used as the base modelfor our experiments. ResNet18+CBAM is the model ob-tained by using CBAM in the Resnet18 model, as describedin [32]. Resnet18+CBAM+ARNN is defined analogously.We use a local context of 3×3 to compute the spatial atten-tion for both CBAM and CBAM+ARNN.

Page 8: Siddhesh Khandelwal Leonid Sigal - arXivSiddhesh Khandelwal University of British Columbia skhandel@cs.ubc.ca Leonid Sigal University of British Columbia lsigal@cs.ubc.ca Abstract

Top-1Error (%)

Top-5Error (%)

Rel.Runtime

ResNet18 [8] 25.56 6.87 1xResNet18 + CBAM [32] 25.07 6.57 1.43xResNet18 + CBAM + ARNN 24.18 6.42 4.81x

Table 4: Performance on Image Classification. The Top-1and Top-5 error % are shown for all the models. The ARNNbased model outperforms all other baselines.

Yes/No Number Other TotalRel.

RuntimeMCB [5] 76.06 35.32 43.87 54.84 1xMCB+ATT [5] 76.12 35.84 47.84 56.89 1.66xMCB+ARNN 77.13 36.75 48.23 57.58 2.46x

Table 5: Performance on VQA. In % accuracy.

Results. Top-1 and top-5 error is used to evaluate the per-formance of the models. The results are summarized in Ta-ble 4. CBAM+ARNN provides an improvement of 0.89%on top-1 error over the closest baseline. Note that this gain,though seems marginal, is larger than what CBAM obtainsover ResNet18 with no attention (0.49% on top-1 error).

4.3. Visual Question Answering

Dataset. We evaluate the performance of ARNN on thetask of VQA [2]. The experiments are done on the VQA2.0 dataset [7], which contains images from MSCOCO [18]and corresponding questions. As the test set is not publiclyavailable, we evaluate performance on the validation set.Experimental Setup. We augment ARNN to the Multi-modal Compact Bilinear Pooling (MCB) architecture pro-posed by [5]. This is referred to as MCB+ARNN. Note thateven though MCB doesn’t give state-of-the-art performanceon this task, it is a competitive baseline that allows for easyARNN integration. MCB+ATT is a variant to MCB thatuses a local attention mechanism with δ = 1 from [5]. Forfair comparison, MCB+ARNN also uses a δ = 1 context.Results. The models are evaluated using the accuracy mea-sure defined in [2]. The results are summarized in Table5. MCB+ARNN achieves a 0.69% improvement over theclosest baseline. We believe this marginal improvement isbecause all the models, for each spatial location (i, j), useno context from neighbouring locations (as δ = 1).

4.4. Image Generation

Dataset. We analyze the effect of using ARNN on thetask of image generation. Experiments are performed onthe CelebA dataset [20], which contains 202,599 face im-ages of celebrities, with 40 binary attributes. The data pre-processing is identical to [40]. The models are evaluated onthree attributes: hair color = {black, blond, brown}, gender= {male, female}, and smile = {smile, nosmile}.

Figure 7: Qualitative Results on ModularGAN. Atten-tion masks generated by original ModularGAN [40] andModularGAN augmented with ARNN are shown. Noticethat the hair mask is more uniform for MGAN+ARNN asit is able to encode structural dependencies in the attentionmask. Additional results shown in supplementary material.

Hair Gender SmileRel.

RuntimeMGAN [40] 2.5 3.2 12.6 1xMGAN+ARNN 3.0 1.4 11.4 1.96x

Table 6: Performance on Image Generation. ResNet18[8] Classification Errors (%) for each attribute transforma-tion. ARNN achieves better performance on two tasks.

Experimental Setup. We compare ARNN to a local atten-tion mechanism used in the ModularGAN (MGAN) frame-work [40]. MGAN uses a 3 × 3 local context to obtainattention values. We define MGAN+ARNN as the networkobtained by replacing the local attention with ARNN. Themodels are trained to transform an image given an attribute.Results. To evaluate the performance of the models, sim-ilar to [40], we train a ResNet18 [8] model that classifiesthe hair color, facial expression and gender on the CelebAdataset. The trained classifier achieves an accuracy of93.9%, 99.0% and 93.7% on hair color, gender and smile re-spectively. For each transformation, we pass the generatedimages through this classifier and compute the classificationerror (shown in Table 6). MGAN+ARNN outperforms thebaseline on all categories except hair color. To analyse thisfurther, we look at the attention masks generated for the haircolor transformation by both models. As shown in Figure7, we observe that the attention masks generated by MGANlack coherence over the target region due to discontinuities.MGAN+ARNN, though has a slightly higher classificationerror, generates uniform activation values over the target re-gion by encoding structural dependencies.

5. ConclusionIn this paper, we developed a novel structured spa-

tial attention mechanism which is end-to-end trainable andcan be integrated with any feed-forward convolutional neu-ral network. The proposed AttentionRNN layer explic-itly enforces structure over the spatial attention variablesby sequentially predicting attention values in the spatialmask. Experiments show consistent quantitative and quali-tative improvements on a large variety of recognition tasks,datasets and backbone architectures.

Page 9: Siddhesh Khandelwal Leonid Sigal - arXivSiddhesh Khandelwal University of British Columbia skhandel@cs.ubc.ca Leonid Sigal University of British Columbia lsigal@cs.ubc.ca Abstract

Supplementary Material

Section 1 explains the architectures for the models used in the experiments (Section 4 in the main paper). Section 2provides additional visualizations for the task of Visual Attribute Prediction (Section 4.1 in the main paper) and Image Gen-eration (Section 4.4 in the main paper). These further show the effectiveness of our proposed structured attention mechanism.

1. Model Architectures1.1. Visual Attribute Prediction

Please refer to Section 4.1 of the main paper for the task definition. Similar to [25], the base CNN architecture is composedof four stacks of 3×3 convolutions with 32 channels followed by 2×2 max pooling layer. SAN computes attention only on theoutput of the last convolution layer, while ¬CTX, CTX and all variants of ARNN are applied after each pooling layer. Table7 illustrates the model architectures for each network. {¬CTX, CTX, ARNN}sigmoid refers to using sigmoid non-linearityon the generated attention mask before applying it to the image features. Similarly, {¬CTX, CTX, ARNN}softmax refers tousing softmax non-linearity on the generated attention mask. We use the same hyper-parameters and training procedure forall models, which is identical to [25].

For the scalability experiment described in Section 4.1, we add an additional stack of 3 × 3 convolution layer followedby a 2 × 2 max pooling layer to the ARNN architecture described in Table 7. This is used as the base architecture. Table 8illustrates the differences between the models used to obtain results mentioned in Table 3 of the main paper.

SAN ¬CTX CTX ARNN

conv1 (3x3@32)

pool1 (2x2)

↓ ¬CTXsigmoid CTXsigmoid ARNNsigmoid

conv2 (3x3@32)

pool2 (2x2)

↓ ¬CTXsigmoid CTXsigmoid ARNNsigmoid

conv3 (3x3@32)

pool3 (2x2)

↓ ¬CTXsigmoid CTXsigmoid ARNNsigmoid

conv4 (3x3@32)

pool4 (2x2)

SAN ¬CTXsoftmax CTXsoftmax ARNNsoftmax

Table 7: Architectures for the models used in Section 4.1 of the main paper. ↓ implies that the previous and the next layer aredirectly connected. The input is passed to the top-most layer. The computation proceeds from top to bottom.

1.2. Image Classification

Please refer to Section 4.2 of the main paper for the task definition. We augment ARNN to the convolution block attentionmodule (CBAM) proposed by [32]. For a given feature map, CBAM computes two different types of attentions: 1) channel

Page 10: Siddhesh Khandelwal Leonid Sigal - arXivSiddhesh Khandelwal University of British Columbia skhandel@cs.ubc.ca Leonid Sigal University of British Columbia lsigal@cs.ubc.ca Abstract

NONE ARNN BRNN

conv1 (3x3@32)

pool1 (2x2)

↓ ARNNsigmoid BRNNδsigmoidARNN (described in Table 7)

Table 8: Model architectures for the scalability study described in Section 4.1 of the main paper. ↓ implies that the previousand the next layer are directly connected. ARRN in defined in Table 7.

attention that exploits the inter channel dependencies in a feature map, and 2) spatial attention that uses local context toidentify relationships in the spatial domain. Figure 8a shows the CBAM module integrated with a ResNet [8] block. Wereplace only the spatial attention in CBAM with ARNN. This modified module is referred to as CBAM+ARNN. Figure 8bbetter illustrates this modification. Both CBAM and CBAM+ARNN use a local context of 3 × 3 to compute attention. Weuse the same hyper-parameters and training procedure for both CBAM and CBAM+ARNN, which is identical to [32].

(a) CBAM module

(b) CBAM+ARNN moduleFigure 8: Difference between CBAM and CBAM+ARNN. (a) CBAM[32] module integrated with a ResNet[8] block. (b)CBAM+ARNN replaces the spatial attention in CBAM with ARNN. It is applied similar to (a) after each ResNet[8] block.Refer to Section 4.2 of the main paper for more details.

1.3. Visual Question Answering

Please refer to Section 4.3 of the main paper for task definition. We use the Multimodal Compact Bilinear Pooling withAttention (MCB+ATT) architecture proposed by [5] as a baseline for our experiment. To compute attention, MCB+ATT usestwo 1× 1 convolutions over the features obtained after using the compact bilinear pooling operation. Figure 9a illustrates thearchitecture for MCB+ATT. We replace this attention with ARNN to obtain MCB+ARNN. MCB+ARNN also uses a 1 × 1

Page 11: Siddhesh Khandelwal Leonid Sigal - arXivSiddhesh Khandelwal University of British Columbia skhandel@cs.ubc.ca Leonid Sigal University of British Columbia lsigal@cs.ubc.ca Abstract

local context to compute attention. Figure 9b better illustrates this modification. We use the same hyper-parameters andtraining procedure for MCB, MCB+ATT and MCB+ARNN, which is identical to [5].

(a) MCB+ATT

(b) MCB+ARNN

Figure 9: Difference between MCB+ATT and MCB+ARNN. (a) MCB+ATT model architecture proposed by [5]. It uses a1×1 context to compute attention over the image features. (b) MCB+ARNN replaces the attention mechanism in MCB+ATTwith ARNN. It is applied in the same location as (a) with 1 × 1 context. Refer to Section 4.3 of the main paper for moredetails.

1.4. Image Generation

Please refer to Section 4.4 of the main paper for task definitions. We compare ARNN to a local attention mechanism usedin the ModularGAN (MGAN) framework [40]. MGAN consists of three modules: 1) encoder module that encodes an inputimage into an intermediate feature representation, 2) generator module that generates an image given an intermediate featurerepresentation as input, and 3) transformer module that transforms a given intermediate representation to a new intermediaterepresentation according to some input condition. The transformer module uses a 33 local context to compute attention overthe feature representations. Figure 10a illustrates the transformer module proposed by [40]. We define MGAN+ARNN as the

Page 12: Siddhesh Khandelwal Leonid Sigal - arXivSiddhesh Khandelwal University of British Columbia skhandel@cs.ubc.ca Leonid Sigal University of British Columbia lsigal@cs.ubc.ca Abstract

network obtained by replacing this local attention mechanism in the transformer module with ARNN. Note that the generatorand encoder modules are unchanged. MGAN+ARNN also uses a 3× 3 local context to compute attention. Figure 10b betterillustrates this modification to the transformer module. We use the same hyper-parameters and training procedure for bothMGAN and MGAN+ARNN, which is identical to [40].

(a) Transformer module for MGAN

(b) Transformer module for MGAN+ARNN

Figure 10: Difference between MGAN and MGAN+ARNN. (a) The transformer module for the ModularGAN (MGAN)architecture proposed by [40]. It uses a 3 × 3 local context to compute attention over the intermediate features. (b)MGAN+ARNN replaces the attention mechanism in MGAN with ARNN. It is applied in the same location as (a) with3 × 3 local context. Note that the generator and encoder modules in MGAN and MGAN+ARNN are identical. Refer toSection 4.4 of the main paper for more details.

Page 13: Siddhesh Khandelwal Leonid Sigal - arXivSiddhesh Khandelwal University of British Columbia skhandel@cs.ubc.ca Leonid Sigal University of British Columbia lsigal@cs.ubc.ca Abstract

2. Additional Visualizations2.1. Visual Attribute Prediction

Please refer to Section 4.1 of the main paper for task definition. Figures 11 - 13 show the individual layer attended featuremaps for three different samples from ARNN∼ for a fixed image and query. It can be seen that ARNN∼ is able to identifythe different modes in each of the images.

Figure 11: Qualitative Analysis of Attention Masks sampled from ARNN∼. Layer-wise attended feature maps sampledfrom ARNN∼ for a fixed image and query. The masks are able to span the different modes in the image. For detailedexplanation see Section 4.1 of the main paper.

Page 14: Siddhesh Khandelwal Leonid Sigal - arXivSiddhesh Khandelwal University of British Columbia skhandel@cs.ubc.ca Leonid Sigal University of British Columbia lsigal@cs.ubc.ca Abstract

Figure 12: Qualitative Analysis of Attention Masks sampled from ARNN∼. Layer-wise attended feature maps sampledfrom ARNN∼ for a fixed image and query. The masks are able to span the different modes in the image. For detailedexplanation see Section 4.1 of the main paper.

Page 15: Siddhesh Khandelwal Leonid Sigal - arXivSiddhesh Khandelwal University of British Columbia skhandel@cs.ubc.ca Leonid Sigal University of British Columbia lsigal@cs.ubc.ca Abstract

Figure 13: Qualitative Analysis of Attention Masks sampled from ARNN∼. Layer-wise attended feature maps sampledfrom ARNN∼ for a fixed image and query. The masks are able to span the different modes in the image. For detailedexplanation see Section 4.1 of the main paper.

Page 16: Siddhesh Khandelwal Leonid Sigal - arXivSiddhesh Khandelwal University of British Columbia skhandel@cs.ubc.ca Leonid Sigal University of British Columbia lsigal@cs.ubc.ca Abstract

2.2. Inverse Attribute Prediction

Please refer to Section 4.1 of the main paper for task definition. Figures 14 - 16 show the individual layer attended featuremaps comparing the different attention mechanisms on the MBGinv dataset. It can be seen that ARNN captures the entirenumber structure, whereas the other two methods only focus on a part of the target region or on some background region withthe same color as the number, leading to incorrect predictions.

Figure 14: Qualitative Analysis of Attention Masks on MBGinv . Layer-wise attended feature maps generated by differentmechanisms visualized on images from MBGinv dataset. ARNN is able to capture the entire number structure, whereas theother two methods only focus on a part of the target region or on some background region with the same color as the targetnumber. For detailed explanation see Section 4.1 of the main paper.

Page 17: Siddhesh Khandelwal Leonid Sigal - arXivSiddhesh Khandelwal University of British Columbia skhandel@cs.ubc.ca Leonid Sigal University of British Columbia lsigal@cs.ubc.ca Abstract

Figure 15: Qualitative Analysis of Attention Masks on MBGinv . Layer-wise attended feature maps generated by differentmechanisms visualized on images from MBGinv dataset. ARNN is able to capture the entire number structure, whereas theother two methods only focus on a part of the target region or on some background region with the same color as the targetnumber. For detailed explanation see Section 4.1 of the main paper.

Page 18: Siddhesh Khandelwal Leonid Sigal - arXivSiddhesh Khandelwal University of British Columbia skhandel@cs.ubc.ca Leonid Sigal University of British Columbia lsigal@cs.ubc.ca Abstract

Figure 16: Qualitative Analysis of Attention Masks on MBGinv . Layer-wise attended feature maps generated by differentmechanisms visualized on images from MBGinv dataset. ARNN is able to capture the entire number structure, whereas theother two methods only focus on a part of the target region or on some background region with the same color as the targetnumber. For detailed explanation see Section 4.1 of the main paper.

Page 19: Siddhesh Khandelwal Leonid Sigal - arXivSiddhesh Khandelwal University of British Columbia skhandel@cs.ubc.ca Leonid Sigal University of British Columbia lsigal@cs.ubc.ca Abstract

2.3. Image Generation

Please refer to Section 4.4 of the main paper for task definition. Figures 17 and 18 show the attention masks generated byMGAN and MGAN+ARNN for the task of hair color transformation. MGAN+ARNN encodes structural dependencies inthe attention values, which is evident from the more uniform and continuous attention masks. MGAN, on the other hand, hassharp discontinuities which, in some cases, leads to less accurate hair color transformations.

Figure 17: Qualitative Results for Image Generation. Attention masks generated by MGAN and MGAN+ARNN areshown. Notice that the hair mask is more uniform for MGAN+ARNN as it is able to encode structural dependencies in theattention mask. For detailed explanation see Section 4.4 of the main paper.

Page 20: Siddhesh Khandelwal Leonid Sigal - arXivSiddhesh Khandelwal University of British Columbia skhandel@cs.ubc.ca Leonid Sigal University of British Columbia lsigal@cs.ubc.ca Abstract

Figure 18: Qualitative Results for Image Generation. Attention masks generated by MGAN and MGAN+ARNN areshown. Notice that the hair mask is more uniform for MGAN+ARNN as it is able to encode structural dependencies in theattention mask. For detailed explanation see Section 4.4 of the main paper.

Page 21: Siddhesh Khandelwal Leonid Sigal - arXivSiddhesh Khandelwal University of British Columbia skhandel@cs.ubc.ca Leonid Sigal University of British Columbia lsigal@cs.ubc.ca Abstract

References[1] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Neural

module networks. In IEEE Conference on Computer Visionand Pattern Recognition, 2016. 2

[2] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L.Zitnick, and D. Parikh. Vqa: Visual question answering. InIEEE International Conference on Computer Vision, pages2425–2433, 2015. 1, 8

[3] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine trans-lation by jointly learning to align and translate. In Interna-tional Conference on Learning Representations, 2015. 4

[4] L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, andT.-S. Chua. Sca-cnn: Spatial and channel-wise attention inconvolutional networks for image captioning. In IEEE Con-ference on Computer Vision and Pattern Recognition, pages6298–6306. IEEE, 2017. 1, 2

[5] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, andM. Rohrbach. Multimodal compact bilinear pooling for vi-sual question answering and visual grounding. In Confer-ence on Empirical Methods in Natural Language Processing,pages 457–468. ACL, 2016. 1, 2, 3, 8, 10, 11

[6] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen-erative adversarial nets. In Advances in Neural InformationProcessing Systems, pages 2672–2680, 2014. 3

[7] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, andD. Parikh. Making the v in vqa matter: Elevating the roleof image understanding in visual question answering. InIEEE Conference on Computer Vision and Pattern Recog-nition, pages 6325–6334, 2017. 8

[8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. In IEEE Conference on ComputerVision and Pattern Recognition, pages 770–778, 2016. 7, 8,10

[9] S. Hochreiter and J. Schmidhuber. Long short-term memory.Neural computation, 9(8):1735–1780, 1997. 3

[10] R. Hu, J. Andreas, K. Saenko, and T. Darrell. Explainableneural computation via stack neural module networks. InEuropean Conference on Computer Vision, 2018. 2

[11] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatialtransformer networks. In Advances in Neural InformationProcessing Systems, pages 2017–2025, 2015. 2

[12] J. Johnson, B. Hariharan, L. van der Maaten, F.-F. Li, L. Zit-nick, and R. Girshick. Clevr: A diagnostic dataset for com-positional language and elementary visual reasoning. InIEEE Conference on Computer Vision and Pattern Recog-nition, 2017. 1

[13] J. Johnson, A. Karpathy, and L. Fei-Fei. Densecap: Fullyconvolutional localization networks for dense captioning. InIEEE Conference on Computer Vision and Pattern Recogni-tion, 2016. 1

[14] D. P. Kingma and M. Welling. Auto-encoding variationalbayes. stat, 1050:10, 2014. 3

[15] A. Krizhevsky. Learning multiple layers of features fromtiny images. Technical report, Citeseer, 2009. 7

[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassification with deep convolutional neural networks. In

Advances in Neural Information Processing Systems, pages1097–1105, 2012. 1

[17] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceed-ings of the IEEE, 86(11):2278–2324, 1998. 6

[18] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-manan, P. Dollar, and C. L. Zitnick. Microsoft coco: Com-mon objects in context. In European conference on computervision, pages 740–755, 2014. 8

[19] C. Liu, J. Mao, F. Sha, and A. L. Yuille. Attention correct-ness in neural image captioning. In AAAI, pages 4176–4182,2017. 6, 7

[20] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning faceattributes in the wild. In IEEE International Conference onComputer Vision, 2015. 8

[21] J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchicalquestion-image co-attention for visual question answering.In Advances In Neural Information Processing Systems,pages 289–297, 2016. 1, 2, 3

[22] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochasticbackpropagation and approximate inference in deep genera-tive models. International Conference on Machine Learning,2014. 3

[23] T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma. Pixel-cnn++: Improving the pixelcnn with discretized logistic mix-ture likelihood and other modifications. ICLR, 2017. 3

[24] P. H. Seo, A. Lehrmann, B. Han, and L. Sigal. Visual refer-ence resolution using attention memory for visual dialog. InAdvances in Neural Information Processing Systems, pages3719–3729, 2017. 1, 2, 3, 5

[25] P. H. Seo, Z. Lin, S. Cohen, X. Shen, and B. Han. Progres-sive attention networks for visual attribute prediction. BritishMachine Vision Conference, 2018. 1, 2, 3, 6, 9

[26] S. Shankar and S. Sarawagi. Posterior attention models forsequence to sequence learning. In International Conferenceon Learning Representations, 2019. 4

[27] Y. Tang, N. Srivastava, and R. R. Salakhutdinov. Learninggenerative models with visual attention. In Advances in Neu-ral Information Processing Systems, 2014. 1

[28] T. Tommasi, A. Mallya, B. Plummer, S. Lazebnik, A. Berg,and T. Berg. Solving visual madlibs with multiple cues. InBritish Machine Vision Conference, 2016. 1

[29] A. van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals,A. Graves, et al. Conditional image generation with pixel-cnn decoders. In Advances in Neural Information ProcessingSystems, pages 4790–4798, 2016. 3

[30] A. Van Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixelrecurrent neural networks. In IEEE International Conferenceon Machine Learning, pages 1747–1756, 2016. 2, 3, 4, 5

[31] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney,T. Darrell, and K. Saenko. Sequence to sequence-video totext. In Proceedings of the IEEE international conference oncomputer vision, pages 4534–4542, 2015. 4

[32] S. Woo, J. Park, J.-Y. Lee, and I. So Kweon. Cbam: Convo-lutional block attention module. In European Conference onComputer Vision, pages 3–19, 2018. 1, 2, 3, 7, 8, 9, 10

Page 22: Siddhesh Khandelwal Leonid Sigal - arXivSiddhesh Khandelwal University of British Columbia skhandel@cs.ubc.ca Leonid Sigal University of British Columbia lsigal@cs.ubc.ca Abstract

[33] J. Xiao, K. A. Ehinger, J. Hays, A. Torralba, and A. Oliva.Sun database: Exploring a large collection of scene cate-gories. International Journal of Computer Vision, 119(1):3–22, 2016. 6

[34] H. Xu and K. Saenko. Ask, attend and answer: Exploringquestion-guided spatial attention for visual question answer-ing. In European Conference on Computer Vision, pages451–466, 2016. 1, 2

[35] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudi-nov, R. Zemel, and Y. Bengio. Show, attend and tell: Neuralimage caption generation with visual attention. In Interna-tional Conference on Machine Learning, pages 2048–2057,2015. 1, 2, 3, 6

[36] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stackedattention networks for image question answering. In IEEEConference on Computer Vision and Pattern Recognition,pages 21–29, 2016. 1, 2

[37] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo. Image caption-ing with semantic attention. In IEEE conference on computervision and pattern recognition, pages 4651–4659, 2016. 1, 2

[38] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena. Self-attention generative adversarial networks. In arXiv preprintarXiv:1805.08318, 2018. 1

[39] H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, andD. Metaxas. Stackgan: Text to photo-realistic image syn-thesis with stacked generative adversarial networks. IEEEInternational Conference on Computer Vision, 2017. 3

[40] B. Zhao, B. Chang, Z. Jie, and L. Sigal. Modular generativeadversarial networks. European Conference on ComputerVision, 2018. 1, 2, 3, 8, 11, 12

[41] H. Zheng, J. Fu, T. Mei, and J. Luo. Learning multi-attentionconvolutional neural network for fine-grained image recogni-tion. In IEEE International Conference on Computer Vision,volume 6, 2017. 1, 2

[42] Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei. Visual7w:Grounded question answering in images. In IEEE Confer-ence on Computer Vision and Pattern Recognition, pages4995–5004, 2016. 1, 2, 3