11
Dual Attention on Pyramid Feature Maps for Image Captioning Litao Yu , Jian Zhang , Qiang Wu University of Technology Sydney [email protected], [email protected], [email protected] Abstract Generating natural sentences from images is a fun- damental learning task for visual-semantic under- standing in multimedia. In this paper, we pro- pose to apply dual attention on pyramid image feature maps to fully explore the visual-semantic correlations and improve the quality of generated sentences. Specifically, with the full considera- tion of the contextual information provided by the hidden state of the RNN controller, the pyramid attention can better localize the visually indica- tive and semantically consistent regions in images. On the other hand, the contextual information can help re-calibrate the importance of feature compo- nents by learning the channel-wise dependencies, to improve the discriminative power of visual fea- tures for better content description. We conducted comprehensive experiments on three well-known datasets: Flickr8K, Flickr30K and MS COCO, which achieved impressive results in generating de- scriptive and smooth natural sentences from im- ages. Using either convolution visual features or more informative bottom-up attention features, our composite captioning model achieves very promis- ing performance in a single-model mode. The pro- posed pyramid attention and dual attention methods are highly modular, which can be inserted into var- ious image captioning modules to further improve the performance. 1 Introduction The image captioning task is to apply machine learning meth- ods to reveal the image contents and generate descriptive natural sentences, which combines visual understanding in computer vision and machine translation in natural language processing. The target of image captioning is to bridge the semantic gap between visual feature descriptors and human languages. Usually, an image captioning framework con- sists of two sub-models, an encoder that extracts the vi- sual features from images, and a decoder that translates the visual properties to natural sentences. The encoder is es- sentially a pre-trained convolutional neural network (CNN), and the decoder is mainly a recurrent neural network (RNN) (a) Single-scale attention (b) Multi-scale attention Figure 1: Single-scale vs multi-scale attention maps. Caption: a man is riding on a motorcycle . that controls the word-flow in the sentences. Thus, the challenges of image captioning are two-fold: first, as a vi- sual understanding task, the feature representation should be discriminative enough to determine what visual proper- ties to describe; second, the language model should be able to generate sentences that accurately describe the seman- tics in images. The second requirement differs from cross- modal retrieval tasks [Wang et al., 2017; Xu et al., 2017b; Li et al., 2016], which do not need to assemble words and form smooth sentences. From the perspective of instance-based visual understand- ing, image captioning is a many-to-many learning task, i.e., both image inputs and sentence outputs contain multiple in- stances, and they should be well correlated for sentence gen- eration. Inferring with a unique global visual feature vec- tor is not a good option to describe multiple instances in an image, because the many-to-many correlations between the two modalities are usually lost, thus aggregating local vi- sual descriptors is more suitable for word generation in a salient region at different time steps in a sentence. With the help of the partial contextual information, such aggre- gation is feasible via the control of the RNN module in the decoder, and these salient regions can be localized in accor- dance with the temporal order of the natural sentence, which is called attention as a mechanism for visual feature selec- tion. The temporal attention model is specifically designed for the task of machine translation [Zhang et al., 2018a; Cheng, 2019]. As a kind of image-to-sentence translation, attention models have also been used in learning-based im- age captioning. In [Xu et al., 2015], the authors introduced arXiv:2011.01385v1 [cs.CV] 2 Nov 2020

Dual Attention on Pyramid Feature Maps for Image Captioning

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Dual Attention on Pyramid Feature Maps for Image Captioning

Dual Attention on Pyramid Feature Maps for Image Captioning

Litao Yu , Jian Zhang , Qiang WuUniversity of Technology Sydney

[email protected], [email protected], [email protected]

AbstractGenerating natural sentences from images is a fun-damental learning task for visual-semantic under-standing in multimedia. In this paper, we pro-pose to apply dual attention on pyramid imagefeature maps to fully explore the visual-semanticcorrelations and improve the quality of generatedsentences. Specifically, with the full considera-tion of the contextual information provided by thehidden state of the RNN controller, the pyramidattention can better localize the visually indica-tive and semantically consistent regions in images.On the other hand, the contextual information canhelp re-calibrate the importance of feature compo-nents by learning the channel-wise dependencies,to improve the discriminative power of visual fea-tures for better content description. We conductedcomprehensive experiments on three well-knowndatasets: Flickr8K, Flickr30K and MS COCO,which achieved impressive results in generating de-scriptive and smooth natural sentences from im-ages. Using either convolution visual features ormore informative bottom-up attention features, ourcomposite captioning model achieves very promis-ing performance in a single-model mode. The pro-posed pyramid attention and dual attention methodsare highly modular, which can be inserted into var-ious image captioning modules to further improvethe performance.

1 IntroductionThe image captioning task is to apply machine learning meth-ods to reveal the image contents and generate descriptivenatural sentences, which combines visual understanding incomputer vision and machine translation in natural languageprocessing. The target of image captioning is to bridge thesemantic gap between visual feature descriptors and humanlanguages. Usually, an image captioning framework con-sists of two sub-models, an encoder that extracts the vi-sual features from images, and a decoder that translates thevisual properties to natural sentences. The encoder is es-sentially a pre-trained convolutional neural network (CNN),and the decoder is mainly a recurrent neural network (RNN)

(a) Single-scale attention (b) Multi-scale attention

Figure 1: Single-scale vs multi-scale attention maps. Caption: a manis riding on a motorcycle.

that controls the word-flow in the sentences. Thus, thechallenges of image captioning are two-fold: first, as a vi-sual understanding task, the feature representation shouldbe discriminative enough to determine what visual proper-ties to describe; second, the language model should be ableto generate sentences that accurately describe the seman-tics in images. The second requirement differs from cross-modal retrieval tasks [Wang et al., 2017; Xu et al., 2017b;Li et al., 2016], which do not need to assemble words andform smooth sentences.

From the perspective of instance-based visual understand-ing, image captioning is a many-to-many learning task, i.e.,both image inputs and sentence outputs contain multiple in-stances, and they should be well correlated for sentence gen-eration. Inferring with a unique global visual feature vec-tor is not a good option to describe multiple instances in animage, because the many-to-many correlations between thetwo modalities are usually lost, thus aggregating local vi-sual descriptors is more suitable for word generation in asalient region at different time steps in a sentence. Withthe help of the partial contextual information, such aggre-gation is feasible via the control of the RNN module in thedecoder, and these salient regions can be localized in accor-dance with the temporal order of the natural sentence, whichis called attention as a mechanism for visual feature selec-tion. The temporal attention model is specifically designedfor the task of machine translation [Zhang et al., 2018a;Cheng, 2019]. As a kind of image-to-sentence translation,attention models have also been used in learning-based im-age captioning. In [Xu et al., 2015], the authors introduced

arX

iv:2

011.

0138

5v1

[cs

.CV

] 2

Nov

202

0

Page 2: Dual Attention on Pyramid Feature Maps for Image Captioning

two attention-based image captioning models under a com-mon framework: a “soft” deterministic attention and a “hard”stochastic attention, respectively. The former model is trainedby standard back-propagation method while the latter one istrained in a reinforcement learning approach. Following thiswork, various image captioning models have been proposedto further boost the captioning performance [You et al., 2016;Chen et al., 2017; Yao et al., 2017; Anderson et al., 2018;Qin et al., 2019]. While encouraging results on publicdatasets have been reported, the performance gains obtainedfrom the latest proposed image captioning models mainly relyon additional deep learning models such as attribute learning[Yao et al., 2017] and region proposal network [Anderson etal., 2018], to get fine-tuned semantic information. Such con-ditions are not always satisfied either because the lack of labelannotation or the computational resource limitation.

We rethink the importance of visual feature representa-tion for better attentions in a single inference model, to im-prove the performance of image captioning. In the decoderof an attention-based image captioning model, the RNN con-trols a small region of the input image to gaze at a timestep. Such a region is represented by a feature vector ag-gregated by a soft-max function or a hard mask function ina salient area of an input image. However, the objects usu-ally render at multi-scales, so it is inaccurate if the attentionmodule just looks at one position at a single scale. Conse-quently, the single-scale spatial attention may lead to the mis-matched visual-semantic relationship because of the incon-spicuous classes. As is illustrated in Figure 1, the motorcycleshould be represented in a larger receptive field in the image,and such kind of 2D areas usually cannot be well describedvia a softmax function over the whole convolutional featuremaps. In [Anderson et al., 2018], the authors proposed touse a pre-trained object detector to estimate the boundingboxes of the entities, forming a bottom-up feature represen-tation. Such an attention mechanism becomes a very goodstart point for improving the semantically correct mappingfor image captioning [Huang et al., 2019a; Yao et al., 2019;Qin et al., 2019].

In this paper, we propose to apply a pyramid attention mod-ule on either convolutional or bottom-up image features toform multi-scale feature representations thus generate richerattentions. When using the convolutional features without thehelp of the auxiliary object detector, the attention can stillcapture the visual properties accurately to facilitate the im-age description. When using the bottom-up image features,the pyramid attention yields semantic hierarchies. In eithercase, the pyramid feature representations can outperform thesingle-scale feature maps. Furthermore, we employ a dualattention module from both spatial and channel perspectivesin the decoder part. The spatial attention is conducted at alocal image area when given the contextual information ofa hidden state of RNN, while the channel attention is to re-calibrate the feature components from a different view of fea-ture maps. With such settings, both the spatial attention onpyramid feature maps and the dual attention on the visualfeature representations can improve the image feature rep-resentations separately. When combining the two modulesto form a unified image captioning framework, our proposed

method is able to achieve very competitive performance as asingle image captioning model in terms of BLEU [Papineni etal., 2002], METEOR [Banerjee and Lavie, 2005], ROUGE-L[Lin, 2004], and CIDEr-D [Vedantam et al., 2015], on threepublicly available datasets: Flickr8K [Hodosh et al., 2013],Flickr30K [Young et al., 2014] and MS COCO [Lin et al.,2014].

The rest of the paper is organized as follows. Section 2introduces related work. Section 3 elaborates the proposeddual attention on pyramid feature maps for image captioning.Experimental results and analysis are presented in Section 4.Finally, Section 5 concludes the paper.

2 Related work2.1 Image captioningInspired by the successful use of convolutional neural net-works (CNNs) in computer vision and recurrent neural net-works (RNNs) in natural langue processing, a large num-ber of deep learning-based visual captioning methods havebeen proposed in recent years. The pioneering work usingthe encoder-decoder structure to generate natural sentencesis proposed by Vinyals et al. [Vinyals et al., 2015], whichis now still the most widely used captioning structure. In[Aneja et al., 2018], Aneja et al. proposed an alternativemethod using the 1D convolutional decoder to generate sen-tences, which has the comparable performance to the RNNbased decoder, while has a faster training time per num-ber of parameters. The research targets of image captioningcan be categorized as the following three aspects: (1) howto accurately describe the image contents [Qin et al., 2019;Yao et al., 2017], (2) how to improve the training strategy[Liu et al., 2017b], and (3) how to evaluate the captioningmodel [Cui et al., 2018]. In our work, we mainly focus onhow to prepare better visual feature representations to im-prove captioning quality. A related visual understanding taskis video captioning, which also aims to generate natural sen-tences by considering the dynamic properties [Xu et al., 2020;Yan et al., 2019].

2.2 Feature pyramidIn image processing, feature pyramid has been heavily usedin the era of hand-engineered features, while it is still a com-mon scheme in deep learning. The advantage of the pyra-mid feature is that the model can “observe” receptive fieldsat multi-scales to localize and recognize the target patternsin a hierarchical manner. For example, in semantic segmen-tation, pyramid features can assist the dense prediction atpixel-level by considering the contextual visual cues [Zhaoet al., 2017]. In [Lin et al., 2017], pyramid strategy com-bines the low-resolution but semantically strong features withhigh-resolution but semantically weak features for object de-tection. Inspired by these successful use of pyramid features,the model proposed in this paper also benefit from this strat-egy to prepare better visual feature representations thus gen-erate accurate image descriptions.

2.3 Attention modelsAttention model has been widely used in many machinelearning tasks including machine translation [Zhang et al.,

Page 3: Dual Attention on Pyramid Feature Maps for Image Captioning

2018a; Cheng, 2019], image classification [Hu et al., 2018]and semantic segmentation [Fu et al., 2019]. The general ideaof attention mechanism is to apply gated weights to enhanceor suppress feature components. In [Hu et al., 2018], the au-thors proposed a squeeze-and-excitation module to enhancethe network from the perspective of the channel-wise rela-tionship. The motivation behind this is to explicitly modelchannel-interdependencies within the module by selectivelyenhancing useful features and suppressing useless ones. Inimage captioning, various attention methods have been pro-posed to improve the sentence generation [Chen et al., 2017;Anderson et al., 2018; You et al., 2016]. Specifically, [An-derson et al., 2018] uses a pre-trained object detector to nat-ually form a bottom-up attention, so the 2D visual patternsin images can be accurately localized. In machine transla-tion, the recent proposed transformer model [Vaswani et al.,2017] can well capture the contextual dependencies in naturallanguages, which was then applied to the attention on atten-tion (AoA) model for image captioning [Huang et al., 2019a].Inspired by the above attention models, our proposed imagecaptioning model applies attention mechanism mainly fromthe perspective of visual feature representations, to fully ex-plore the visual-semantic correlations between image featuremaps and natural sentences.

3 MethodIn this section, we first review the general framework of spa-tial attention-based image captioning model, then introducethe proposed learning methods. Specifically, we detail twoimprovements in both encoder and decoder, respectively, andexplain why they can further boost the captioning perfor-mance.

3.1 Review of spatial attention for imagecaptioning

The basic deep learning-based image captioning approachgenerally feeds the image into a CNN for encoding, then runsthis encoding into an RNN decoder to generate an output sen-tence. The model backpropagates based on the error of theoutput sentence compared with the groud truth sentence cal-culated by a loss function like cross-entropy/maximum like-lihood [Vinyals et al., 2015]. The performance of sentencegeneration can be further optimized via policy gradient whengiven a reward (e.g., CIDEr score [Vedantam et al., 2015]) ofthe generated sentence [Liu et al., 2017b]. However, translat-ing from a single image feature vector confuses the multiplevisual patterns that correspond to the words in the predictedsentences, because the spatial dependencies in the 2D featuremaps are totally lost. To solve this problem, spatial attentionmethod tries to build the correlation between a specific 2Dregion in an image and a word (or phrase) in a sentence toimplement more accurate mappings.

The image captioning model taks an image I and generatesa captption Y . Y is encoded as a sequence of K words:

Y = {yt ∈ RK |t = 1, . . . , T}, (1)where yt is the word probability vector at the time step t, Kis the vocabulary size and T is the length of the generatedsentence.

A pre-trained CNN model is usually used as an encoder forimage I . For spatial attention models, we would like to keepthe spatial information of the image, so the final convolutionaloutput is with the shape w × h× d, where w, h and d are thewidth, the height and the number of channels of the featuremap, respectively. The output of the encoder is a tensor withthe shape w × h× d, which can be conveniently reshaped toa feature map V ∈ RL×d, where L = w × h. V can beconsidered as a set of feature vectors {Vi|i = 1, . . . , L}, i.e.,each row of V is a d-dimensional feature vector describing asmall 2D region in image I .

The spatial attention in the decoder aims to obtain a contextvector v(s)

t ∈ Rd, which is weighted by a spatial attentionvector α = [α1, . . . , αL] ∈ RL as follows:

v(s)t =

1

L

L∑i=1

αiVi. (2)

In image captioning, the spatial attention vector α is deter-mined by the contextual semantics of the sentence. Specifi-cally, the temporal dependencies are mainly sketched by thehidden state of an RNN model. Following the common prac-tise, we use Long-Short Term Memory (LSTM) [Hochreiterand Schmidhuber, 1997] as the RNN controller in the de-coder. The LSTM learns the sequential state of a certain cellby using several non-linear mappings in the state-to-state andinput-to-state transitions. For image captioning, the spatial at-tention vector α is computed from a hidden RNN state ht−1

at a time step t− 1 and the feature map V:a = tanh((VWs + bs)⊕ ht−1Whs),

α = softmax(aWa + ba), (3)where a is the spatial score vector, which controls the loca-tion of semantically salient areas. Ws, Whs and Wa aremapping matrices that project the feature maps and hiddenstate to the same dimension, bs and ba are bias vectors, and⊕ is the broadcast adding operator, respectively.

From the work-flow of the spatial attention-based imagecaptioning, we have two important observations as follows:

• At each time step, the hidden state ht−1 of LSTM con-trols the gaze at only a specific grid, which approximatesto a one-hot mask, rather in a larger receptive field of theinput image I (see the softmax operation in Eq. (3)).Such a setting essentially hurts the mapping from thevisual properties to the word in the generated sentence,because a small grid is usually insufficient to describethe object or stuff identity, and it is unusual to computeless peaky weights to attend larger 2D receptive fields.Furthermore, if the spatial attention gives a “bad” posi-tion of the image region, the decoder is very likely tomismatch the visual-semantic relations.

• The spatial attention is unable to prepare the best featurerepresentation prior because it only focuses on “whereto gaze” in the whole image at each time step. In thechannel dimension, it fails to select the most useful com-ponents of the feature vector and suppress the less usefulones when given the contextual information of the sen-tence, i.e., the spatial attention is unable to better “con-trol” the current prediction in a partial sentence.

Page 4: Dual Attention on Pyramid Feature Maps for Image Captioning

To summarize the above observations, the visual-semanticinconsistencies are related to the scales of attention. At thesame time, the feature map in the channel dimension can bebetter represented to meet the requirements of word gener-ation. In the following three subsections, we first give thedetails of the design of pyramid feature maps to generate bet-ter attention regions, and a dual attention mechanism to re-calibrate the importance of the feature vector in the channeldimension by considering the hidden state of LSTM, then il-lustrate the overview of the proposed learning framework.

3.2 Pyramid attentionWe introduce the pyramid attention module to make it capa-ble of capturing multi-scale spatial properties in images. In adeep neural network, the size of receptive field roughly indi-cates how much information can be used to identify an object(e.g., a car or a person), some stuff (e.g., grass or river), or acomposite visual patterns. The empirical receptive field in theconvolutional output of a CNN is much smaller on high-levellayers because the information is distilled, i.e., most of thevisual properties are filtered in convolution operations. Thismakes the attention on a very small receptive field insuffi-ciently incorporate the momentous prior to a receptive fieldin a larger size. So we address this issue by using the pyra-mid representation prior with multi-scale feature maps.

In a convolutional feature representation, one image isequally segmented intow×h grids, each of which describes asmall region of the image. To produce the multiple vectorizedfeatures at different scales of the receptive field, we add ex-tra average pooling modules to form the hierarchical featuremaps by adopting varying-size pooling kernels. The coarsestlevel of average pooling generates the largest sub-regions ofimages to attend, and the following pyramid level separatesthe feature map into smaller areas. So the pyramid featuremaps describe all possible sub-regions with varied sizes toattend for a better description. The number of pyramid lev-els and the size of each level in the encoder can be modified,which are related to the size of the feature map that is fed intothe decoder. Different from the settings in [Zhao et al., 2017]that concatenates all pyramid feature maps in the channel di-mension and the resolution is unchanged, we just increase thenumber of 2D regions at multi-scales in the encoder to gener-ate rich feature maps. The multi-scale pooling kernel shouldmaintain a reasonable gap in feature representation. In oursetting, the pyramid attention is set to three levels with thebin sizes 1× 1 (original), 2× 2 and 4× 4, respectively. In theextreme case, global average pooling describes an image asa single visual feature vector, which is equivalent to the “noattention” model for image captioning [Vinyals et al., 2015].Extending the spatial attention to multi-scale pyramid atten-tion can enrich the visual feature representation and sketchthe hierarchical semantic pattern structures. Even if somesynthesis patterns are completely useless for sentence gener-ation, they can still be well suppressed by the RNN controllerby setting to very low feature weights.

Recently, the bottom-up attention implemented by a pre-trained object detector can boost the caption performancedue to the more accurate bounding boxes rather than theequal-sized grids [Anderson et al., 2018]. For general im-

age captioning purposes, such bottom-up attention with thehelp of auxiliary models can effectively improve the featuredescriptive power. However, each visual feature vector inthe bottom-up attention representation is no longer spatial-indicative. Even when the spatial properties collapse, thepyramid attention still applies. Note that the order of the localfeature maps in the 2D feature representation can be arbitrar-ily changed, which does not affect the spatial attention, be-cause the attention weights are solely dependent on the RNNhidden state and local feature representations. In this case, thepyramid attention becomes the visual feature synthesis thatdescribes the composite visual patterns, which is similar tothe scenario of the spatial-semantic search method proposedin [Mai et al., 2017]. In the following parts of this article, wesimply refer the attention on pyramid feature maps as pyra-mid attention (P-attention).

3.3 Dual attentionThe spatial attention in Eq. (2) requires the visual featuremap V and the hidden state ht−1 to calculate the weights inthe width and height dimensions. However, the attention vec-tor v(s)

t is just a linear combination of Vi for i = 1, . . . , L,in which the channel dimension is unchanged. In the convo-lutional feature representation, each channel vector in a 2Dgrid or an object bounding box can be regarded as a word re-sponse, and different semantic responses are mutually asso-ciated with each other. By modelling the inter-dependenciesamong the 2D feature maps, it can improve the feature repre-sentation of specific semantics. Hence, to better represent thecontext vector to generate more accurate and smooth naturalsentences, we enhance the feature map by introducing a dualattention module, which does not only focus on “where togaze” in the spatial perspective, but also re-calibrate the im-portance of the feature components to improve the discrimi-native power.

To discover the channel-wise dependency of the featurerepresentation, we aim to learn a channel weight vector β =[β1, . . . , βd] ∈ Rd to re-calibrate the feature map V thus forma channel context vector:

v(c)t = β � 1

L

L∑i=1

Vi. (4)

Given a hidden state ht−1 of RNN at the time step t −1, the channel-wise score c and channel weight vector β arecomputed in a similar way to the spatial attention as follows:

c = tanh((WcV + bc)⊕Whcht−1),

β = sigmoid(Wbc+ bb), (5)

where c is the channel attention score vector.Wc, Whc, Wb

are learnable mapping matrices and bc, bb are bias vectors,respectively. Note that we use the sigmoid activation functionto re-weight the visual feature maps, but not the softmax func-tion in the channel dimension. This is mainly because eachchannel dimension works collaboratively with others, whichis a joint feature representation mechanism but not “one-hot”vector-like to describe a small 2D region of images. Sucha setting is similar to the self-attention used in [Shen et al.,2018; Hu et al., 2018].

Page 5: Dual Attention on Pyramid Feature Maps for Image Captioning

By applying the two attention mechanisms, the final con-text vector vt at the time step t to predict the word probabilityvector yt is just the summation of the spatial context vectorv(s)t and the channel context vector v(c)

t , i.e.,

vt = v(s)t + v

(c)t . (6)

When applying both spatial and channel attentions on theimage feature representations, we name such an attentionmodule as dual attention (D-attention).

Note that the dual attention scheme in our model differsfrom SCA-CNN [Chen et al., 2017] from the following twoperspectives: (1) The channel-wise attention and spatial at-tention in SCA-CNN are computed sequentially. The prob-lem of it is the information loss, which is similar to VGG-Net vs. ResNet. Our model is to compute the two attentionsin parallel, which well preserves the most discriminative fea-ture components; (2) The activation of channel-wise attentionused in SCA-CNN is a softmax function, which is the samewith spatial attention in their paper. Such a setting is inap-propriate because softmax is very like the one-hot encoding,which negatively affects the feature representation, so we ap-plied the sigmoid function in the channel-wise attention. Inthe experimental part of this paper, we empirically prove thatour dual attention can achieve much better results than SCA-CNN.

3.4 System overviewNow we introduce how to equip pyramid attention and thedual attention module to a single image captioning model.The improvement the performance can benefit from the twomodules, while limited computation resources overhead arerequired.

The attention model for image captioning is illustrated inFigure 2. Given an input image (a) and a sentence descrip-tion (b), we use a pre-trained CNN (e.g., ResNet-101 [He etal., 2016]) or an object detector (e.g., the bottom-up visualfeatures [Anderson et al., 2018]) to extract the convolutionalfeature maps (c) and a word embedding layer for text repre-sentation. To better learn the word-sequence prediction, thevisual feature maps are augmented by pyramid pooling (d).With the help of LSTM hidden state, the spatial and channelattentions are computed in parallel, forming a spatial contextvector (e) and a channel context vector (f). By summing upthe two vectors, they are fused as a final context vector for theword prediction in a partial sentence (g).

The overall image captioning architecture in our work isbased on the top-down attention model [Anderson et al.,2018], which contains two separate LSTMs: attention LSTMand language LSTM, respectively. In the sentence predic-tion, the input vector to the attention LSTM at each timestep consists of the previous output of the language LSTM,concatenated with the global average pooling image feature

V = 1L

L∑i=1

Vi and the encoding of the previous word:

x1t = [h2

t−1,V,Weqt], (7)

where We is a word embedding matrix and qt is the one-hotencoding of the input word at time step t. These inputs pro-

vide the attention LSTM with the maximal contextual infor-mation regarding the state of the language LSTM, the globalfeature of the image, and the partial sentence, respectively.Then the hidden state of the attention LSTM at the currenttime step t is computed as:

h1t = LSTM(x1

t ,h2t−1). (8)

Given the hidden state h1t , the spatial context vector v(s)

t orthe context vector of dual attentions vt can be computed byEq.(2) or Eq.(6), respectively (replacing ht with h1

t in Eq.(3)and Eq.(5)).

The input of the language LSTM is the concatenation of thecontext image feature and the output of the attention LSTM:

x2t = [vt,h

1t ], (9)

thus the hidden state output of the language LSTM becomes:

h2t = LSTM(x2

t ,h1t ). (10)

Using the notation Y in Eq.(1), at each time step t the con-ditional distribution over possible words is computed by:

p(yt|h2t ) = softmax(Wph

2t + bp), (11)

where Wp and bp are trainable weights and biases, respec-tively. The word distribution over complete sentence is cal-culated as the product of conditional distributions:

p(Y ) =

T∏t=1

p(yt|h2t ). (12)

Assume the whole trainable parameter set is θ. When givena reference sentence represented by y1:T , the most straight-forward way to optimize the captioning model is to minimizethe cross-entropy loss of each individual word:

LCE(θ) = −T∑t=1

log(pθ(y∗t |y∗

1:t−1)). (13)

Though training with the cross-entropy loss enables thefully diffirentiable optimization by backpropagation, thetraining objective is inconsistent with the language evalua-tion metrics (e.g., CIDEr score). Furthermore, this createsa schism between training and evaluation because in the in-ference the model has no access to the previous ground truthtoken, which leads to cascading errors and biased semantics.So after the cross-entropy optimization, we can use the rein-forcement learning to minimize the negative expected reward:

LCD(θ) = −Ey1:T∼pθ [r(y1:T )], (14)where r(·) is the reward function. It has been proved thatusing the CIDEr score in Self-Critical Sequence Training(SCST) [Rennie et al., 2017] can effectively generate sen-tences with higher quality. The approximate gradient is com-puted as follows:

∇θLCD(θ) ≈ −(r(ys1:T )− r(y1:T ))∇θ log(r(ys1:T )), (15)

where y1:T is a sampled caption from the beam-search andy1:T is the baseline score calculated by greedy decoding.

Page 6: Dual Attention on Pyramid Feature Maps for Image Captioning

Figure 2: The learning framework of the dual attention on pyramid feature maps for image captioning.

Fine-tuning the captioning model with reinforcement learn-ing focuses each prediction step to achieve the best score ofthe overall sentence and explores the space of captions bysampling from the policy, whose gradient tends to increasethe probability of sampled captions with a higher CIDErscore. Thus, after a few epochs’ fine-tuning, the CIDEr scorecan be significantly increased, which also benefits all otherevaluation metrics.

4 ExperimentsWe describe the experimental settings and results, then qual-itatively and quantitatively validate the effectiveness of theproposed dual attention on pyramid feature maps method forimage captioning. We validate the effectiveness of the pro-posed dual attention on pyramid feature maps method for im-age captioning, by answering the following two questions:Q1: Are pyramid attention and dual attention effective toprepare better visual feature representations for content de-scription in images? Q2: How does the performance of ourmethod compare to other state-of-the-art image captioningmethods in a single model?

4.1 DatasetsWe report the experimental results on three widely-usedbenchmarks: (1) Flickr8K [Hodosh et al., 2013]: this datasetis comprised of 8000 images in total and is split into 6,000,1,000, and 1,000 images for training, validation and testing,respectively; (2) Flickr30K [Young et al., 2014]: this is alarger dataset with 25,381, 3,000, 3,000 for training, valida-tion and testing, respectively. In both of the above two Flickrdatasets, each image is manually annotated by 5 sentences,and we report the results according to the standard splits. (3)

MS COCO [Lin et al., 2014]: it is the largest image caption-ing dataset as far as we know, which contains 164,042 imagesin total. As the ground truth of the test set is withheld by theorganizers, we followed [Karpathy and Fei-Fei, 2015] to usethe 82,783 training set to learn the model, 5,000 images forvalidation and another 5,000 images for testing, respectively.

4.2 Implementation details and computationalcomplexities

In our implementation, we used two types of visual featuresas image representations: the final convolutional output ofResNet-101 [He et al., 2016] and the bottom-up feature com-puted by a pre-trained object detector [Anderson et al., 2018]on the MS COCO dataset. On the Flickr8K and Flickr30Kdatasets, we only used the convolutional feature outputs. Forthe feature output of ResNet-101, when the input RGB im-age has the resolution 224× 224, the size of the feature mapoutput is 7 × 7 × 2048, i.e., the image is equally segmentedinto 49 grids, each of which describes a specific region of theimage. To produce the multiple vectorized feature represen-tation at different scales of the receptive field, we added a2 × 2 and a 4 × 4 average pooling, both of which were con-ducted on the original convolution output with 1 stride butno padding. Thus, the extra two pooling operations give a6 × 6 × 2048 and a 4 × 4 × 2048 feature map, respectively.On the pyramid representation of convolutional features, weapplied separate dense mappings with ReLU activations to re-duce the feature dimensionality to 1,024. On the MS COCOdataset, we also used the bottom-up features provided by [An-derson et al., 2018], with the fixed 36×2048 feature map size.For language processing, we used a word embedding layer tomap each word in a description sentence to a 1,024d vector.In both of the top-town attention and language LSTMs, the

Page 7: Dual Attention on Pyramid Feature Maps for Image Captioning

sizes of the hidden units were set to 512 without any change.To observe the performance gains of P-attention and D-

attention, we conducted the experiments using the two pro-posed methods separately. After that, we combined them asa single model (P+D attention) to test its effectiveness on im-proving the quality of generated sentences. All of the three at-tention models were trained with stochastic gradient descentusing the AdamW optimizer [Loshchilov and Hutter, 2019].We followed [Luo et al., 2018] to add a discriminability lossin training image captioning models to improve the qualityof resulting sentences. The batch size was set to 96, whichcan well fit the memory of a single Titan Xp GPU card. Onthe Flickr8K and Flickr30K datasets, we only conducted thecross-entropy optimization with convolutional features to testthe effectiveness of the proposed methods. On the MS COCOdataset, we set a two-stage training procedue. At the firststage, we used the BLEU-4 score in the validation for modelselection in minimizing cross-entropy, which finished within60 epochs. At the second stage, we fine-tuned the captioningmodel by optimizing Eq. (13) with 30 epochs. In the infer-ence procedure, we used the beam search (with beam size 5)to generate the best natural sentences.

The visual feature maps of images were all pre-computedand cached to accelerate the training procedure. Assume weuse the feature map of ResNet-101 with size 7 × 7 × 2048,the sentence length is fixed to 50, and the vocabulary size is8,000, imposing the P-attention only leads to very marginalcomputation resources overhead in terms of parameter effi-ciency and FLOPs, while D-attention mainly increases theFLOPs. The numbers of trainable parameters and FLOPs aresummarized in Table 1. Applying the two attentions sepa-rately or building a unified model (P+D attention), althoughhas a higher computational complexity in the training pro-cess, it can still well fit most GPUs and be optimized effi-ciently.

Table 1: Computational complexities.

Model Params FLOPsTop-down attention 29.5M 1.16G

P-attention 29.6M 1.99GD-attention 31.8M 2.68G

P+D attention 31.9M 5.32G

4.3 Evaluation metrics

We use BLEU scores (B@1, B@2, B@3 and B@4) [Pap-ineni et al., 2002] without brevity penalty to evaluate the im-age caption generation. Due to the criticism of BLEU, wealso report the METEOR score (MT) [Banerjee and Lavie,2005]. All of the above metrics were evaluated by the NLTKtoolkit1. For the evaluation on MS COCO dataset, we alsouse ROUGE-L (RG-L) [Lin, 2004] and CIDEr [Vedantam etal., 2015] for evaluation, because they can better measure theconsistency between n-gram occurrences in generated sen-tences and references.

1https://www.nltk.org

4.4 Qualitative resultsWe first show a P-attention result as in Figure 3. Although thesaliency regions only show rough attentions on the originalimage, they provide a broader range of feature maps to attend.In this example, attentions on multi-scale feature maps notonly sketch the most salient object: a brown dog, but alsodescribe the details: a tennis ball in his mouth.

In Table 2, we give some captioning result examples of theproposed attention model. In the comparison of the generatedsentences by the proposed method and the ground truth sen-tences, all our three attention methods can well describe themost important spatial and semantic information of the im-ages to meet the humans’ cognition. The pyramid attentioncan generally sketch the hierarchical visual properties andidentify the important entities, while the dual attention cangive more detailed and accurate descriptions of the entities.The two types of attention are complementary to each other,so the joint learning framework always outperforms any ofthe two single attention models.

4.5 Quantitative resultsWe summarize the statistical results of the proposed P-attention, D-attention and P+D attention methods, then com-pare with state-of-the-art image captioning models in Table3 and Table 4 for the Flickr8K, Flickr30K and MS COCOdatasets, respectively. In the two tables, we use the bold fontto emphasize the best results.

By observing the results in Table 3 obtained by two sepa-rate attentions introduced in Section 3.2 and 3.3, both of thepyramid attention and dual attention can achieve much betterperformance on Flickr8K and Flickr30K datasets. Specifi-cally, on the Flickr8K dataset, the joint attention model (P+Dattention) outperforms the second-best method by 1.8 / 9.6/ 13.5 / 15.6 / 8.5 percent in terms of BLEU-1, BLEU-2,BLEU-3, BLEU-4 and METEOR, respectively. On the largerFlickr30K dataset, the joint attention model (P+D attention)outperforms the second-best method by 0.6 / 8.5 / 13.6 / 17.1/ 5.7 percent in terms of the five evaluation metrics, respec-tively. For the results on this dataset, simply applying thedual attention obtains very high scores in terms of BLEU andMETEOR. The composite model with P+D attention can gen-erally perform better results compared to the two single atten-tions.

On the large-scale MS COCO dataset, we separately re-port the model performance comparisons using convolutionfeatures and bottom-up features. Also, we reported the eval-uation results of both cross-entropy optimization and CIDErfine-tuning. When using convolution features, both pyramidattention and dual attention can effectively improve the base-line (Top-down). After fine-tuning, the performance is furtherboosted.

The very recent image captioning models [Anderson et al.,2018; Qin et al., 2019; Yao et al., 2019; Huang et al., 2019a]adopt the bottom-up attention instead of the traditional convo-lutional features to accurately implement the visual-semanticmapping. With the help of a pre-trained object detector tobetter attend the visual feature maps within the boundingboxes, these models achieve a noteworthy improvement overthe captioning models that using convolutional feature maps.

Page 8: Dual Attention on Pyramid Feature Maps for Image Captioning

Figure 3: A visualization of result of pyramid attention.

Table 2: Examples of image captioning using the proposed methods (using convolutional features).

P-attention: two giraffesstanding next to each othernear rocks.

P-attention: two people ridingon the back of an elephant.

P-attention: a small airplanesitting on top of an airportrunway.

P-attention: a person ridingskis down a snow coveredslope.

D-attention: two giraffesstanding next to each othernear rocks.

D-attention: a group of peopleriding on the backs ofelephants.

D-attention: a small red planesitting on top of an airportrunway.

D-attention: a couple ofpeople riding skis down a snowcovered slope.

P+D attention: a group ofgiraffe standing next to eachother.

P+D attention: three peopleriding on the back of anelephant.

P+D attention: a small redairplane sitting on top of anairport tarmac.

P+D attention: a man ridingskis down a snow coveredslope.

GT: four giraffes walkingalong a path in their enclosure.

GT: three people are sitting onelephants with a little chair.

GT: a large red air plane isparked with a man standingnear.

GT: young skiers enjoying arun on a gentle slope.

However, it requires that the training sources for both detec-tion and description are homogeneous. In some specific caseswhen the two data sources are heterogeneous, e.g., the datasetfor the object detector training contains no person, while thetarget captioning dataset is mainly about human action de-scription, the bottom-up attention will fail.

In our experiment, the pyramid attention on bottom-up fea-tures can be considered as multi-scale semantic aggregation,which can slightly improve the performance compared to theup-down method proposed in [Anderson et al., 2018]. Thedual attention, however, is more effective to prepare better vi-sual feature representation for image captioning. Our com-posite attention method, i.e., P+D attention model trainedwith bottom-up features, achieved very competitive results.Note that our methods P-attention, D-attention and P+D at-tention are highly modular and can be integrated into a vari-ety of image captioning frameworks. In this work, our mod-els are based on the up-down model, while all other compar-ison methods improve the attention-based image captioningmodels from different perspectives. Thus, the performance

of the proposed image captioning model can be further im-proved if we use more advanced settings. For example, us-ing the “look back and predict forward” strategy [Qin et al.,2019]. In our experiments, we have mainly proved that theproposed two attention variants P-attention and D-attention,as well as the joint learning approach P+D attention, are ableto augment the computational capacity by fully exploring thevisual-semantic relationships between visual image featuresand natural languages.

5 ConclusionWe have presented a novel learning framework that appliesdual attention on pyramid feature maps for image caption-ing. Different from the attention model [Huang et al., 2019a]mainly focuses on language generation, we try to explorevisual feature representations for image captioning. The P-attention model takes attentions from multi-scale receptivefields in an image, which is a reasonable way to improve thevisual feature representation by enriching the feature maps.

Page 9: Dual Attention on Pyramid Feature Maps for Image Captioning

Table 3: Performance comparison with the state-of-the-arts on the Flickr8K and Flickr30K datasets. All image captioning models are trainedon convolutional features.

Dataset Method B@1 B@2 B@3 B@4 MT

Flickr8K

Soft-attention[Xu et al., 2015] 67.0 44.8 29.9 19.5 18.9Hard-attention[Xu et al., 2015] 67.0 45.7 31.4 21.3 20.3SCA-CNN[Chen et al., 2017] 68.2 49.6 35.9 25.8 22.4Bi-LSTM[Wang et al., 2016] 65.5 46.8 32.0 21.5 -

P-attention (Ours) 69.1 57.5 47.5 39.5 29.4D-attention (Ours) 69.8 58.5 48.7 40.7 30.2

P+D attention (Ours) 70.0 59.2 49.4 41.4 30.9

Flickr30K

Soft-attention[Xu et al., 2015] 66.7 43.4 28.8 19.1 18.5Hard-attention[Xu et al., 2015] 66.9 43.9 29.6 19.9 18.5

ATT-FCN[You et al., 2016] 64.7 46.0 32.4 23.0 18.9SCA-CNN[Chen et al., 2017] 66.2 46.8 32.5 22.3 19.5Bi-LSTM[Wang et al., 2016] 62.1 42.6 28.1 19.3 -

Saliency+Context Attention[Cornia et al., 2018] 61.5 43.8 30.5 21.3 20.0Attention Correctness[Liu et al., 2017a] - - 38.0 28.1 23.0

Language CNN[Gu et al., 2017] 73.8 56.3 41.9 30.7 22.6Adaptive attention[Lu et al., 2017] 67.7 49.4 35.4 25.1 20.4

Att2in+RD[Guo et al., 2019] - - - - 26.0hLSTMat[Gao et al., 2019] 73.8 55.1 40.3 29.4 23.0

P-attention (Ours) 72.3 62.7 53.3 26.0 28.1D-attention (Ours) 76.5 64.2 53.3 44.6 28.2

P+D attention (Ours) 74.4 64.8 55.5 47.8 28.7

Table 4: Performance comparison with the state-of-the-art on MS COCO dataset. All mothods are evaluated in the single-model mode. Themethods with †are trained with the bottom-up attention, i.e., the visual features are obtained by an auxiliary object detection model.

Method Cross-entropy optimization CIDEr fine-tuningB@1 B@4 MT RG-L CIDEr B@1 B@4 MT RG-L CIDEr

SCA-CNN[Chen et al., 2017] 71.9 31.1 25.0 - - - - - - -Adaptive attention[Lu et al., 2017] 74.2 33.2 26.6 - 108.5 - - - - -

Semantic guidance[Zhang et al., 2018b] 71.2 26.5 24.7 - 88.2 - - - - -Att2in+RD[Guo et al., 2019] - 34.3 26.4 55.2 106.1 - 35.2 27.0 56.7 115.8

SCST[Rennie et al., 2017] - - - - - - 31.3 26.0 54.3 101.3Top-down[Anderson et al., 2018] 74.5 33.4 26.1 54.4 105.4 76.6 34.0 26.5 54.9 111.1

P-attention (Ours) 75.1 33.8 26.0 54.2 105.1 77.7 35.6 27.2 56.5 113.3D-attention (Ours) 75.7 34.7 25.8 54.7 106.7 78.6 35.9 27.7 57.6 114.0

P+D attention (Ours) 76.4 35.5 26.2 54.8 108.4 78.3 36.5 27.3 57.8 114.4Up-down[Anderson et al., 2018]† 77.2 36.2 27.0 56.4 113.5 79.8 36.3 27.7 56.9 120.1

Object relationship[Li and Jiang, 2019]† 76.7 33.8 26.2 54.9 96.5 79.2 36.3 27.6 56.8 120.2hLSTMat[Gao et al., 2019]† - - - - - - 37.5 28.5 58.2 125.6

GCN-LSTM[Yao et al., 2018]† 77.4 37.1 28.1 57.2 117.1 80.9 38.3 28.6 58.5 128.7Attenion on Attention[Huang et al., 2019a]† 77.4 37.2 28.4 57.5 119.8 80.2 38.9 29.2 58.8 129.8

Up-down + HIP[Yao et al., 2019]† - 37.0 28.1 57.1 116.6 - 39.1 28.9 59.2 130.6Up-down + RD[Guo et al., 2019]† - 36.7 27.8 56.8 114.5 - 37.8 28.2 57.9 125.3

Adaptive attention time[Huang et al., 2019b]† - 37.0 28.1 57.3 117.2 - 38.7 28.6 58.5 128.6Global-Local Discriminative Objective [Wu et al., 2020] † - - - - - 78.8 36.1 27.8 57.1 121.1

LBPF[Qin et al., 2019]† 77.8 37.4 28.1 57.5 116.4 80.5 38.3 28.5 58.4 127.6P-attention (Ours)† 77.1 36.5 27.4 56.6 114.3 79.3 38.5 28.8 58.1 124.0D-attention (Ours)† 77.3 36.9 28.5 57.2 116.4 80.7 39.8 29.4 58.8 129.8

P+D attention (Ours)† 77.8 37.3 28.6 57.6 117.3 79.8 39.4 30.1 59.4 129.5

The proposed D-attention model, on the other hand, can bet-ter leverage the channel-wise and spatial-wise features fromtwo different perspectives. Both P-attention and D-attentionmodels can boost the image captioning performance, whilejointly applying the two proposed modules to form a unifiedlearning framework, our P+D attention model achieves thestate-of-the-art performance in a single captioning model. Webelieve that our work can also benefit the research community

of visual-semantic understanding in other learning tasks suchas visual question and answering (VQA) [Gao et al., 2018]and video captioning [Xu et al., 2017a], and we will also ex-plore the potential of the proposed method in both networkstructure and other application fields.

Page 10: Dual Attention on Pyramid Feature Maps for Image Captioning

References[Anderson et al., 2018] Peter Anderson, Xiaodong He, Chris

Buehler, Damien Teney, Mark Johnson, Stephen Gould,and Lei Zhang. Bottom-up and top-down attention for im-age captioning and visual question answering. In CVPR,pages 6077–6086, 2018.

[Aneja et al., 2018] Jyoti Aneja, Aditya Deshpande, andAlexander G Schwing. Convolutional image captioning.In CVPR, pages 5561–5570, 2018.

[Banerjee and Lavie, 2005] Satanjeev Banerjee and AlonLavie. Meteor: An automatic metric for mt evaluationwith improved correlation with human judgments. In ACLworkshop, pages 65–72, 2005.

[Chen et al., 2017] Long Chen, Hanwang Zhang, Jun Xiao,Liqiang Nie, Jian Shao, Wei Liu, and Tat-Seng Chua. Sca-cnn: Spatial and channel-wise attention in convolutionalnetworks for image captioning. In CVPR, pages 5659–5667, 2017.

[Cheng, 2019] Yong Cheng. Agreement-based joint trainingfor bidirectional attention-based neural machine transla-tion. In Joint Training for Neural Machine Translation,pages 11–23. 2019.

[Cornia et al., 2018] Marcella Cornia, Lorenzo Baraldi,Giuseppe Serra, and Rita Cucchiara. Paying more atten-tion to saliency: Image captioning with saliency and con-text attention. ACM Trans. on Multimedia Computing,Communications, and Applications, 14(2):1–21, 2018.

[Cui et al., 2018] Yin Cui, Guandao Yang, Andreas Veit,Xun Huang, and Serge Belongie. Learning to evaluate im-age captioning. In CVPR, pages 5804–5812, 2018.

[Fu et al., 2019] Jun Fu, Jing Liu, Haijie Tian, Yong Li,Yongjun Bao, Zhiwei Fang, and Hanqing Lu. Dual at-tention network for scene segmentation. In CVPR, pages3146–3154, 2019.

[Gao et al., 2018] Lianli Gao, Pengpeng Zeng, JingkuanSong, Xianglong Liu, and Heng Tao Shen. Examine beforeyou answer: Multi-task learning with adaptive-attentionsfor multiple-choice vqa. In MM, pages 1742–1750, 2018.

[Gao et al., 2019] Lianli Gao, Xiangpeng Li, Jingkuan Song,and Heng Tao Shen. Hierarchical lstms with adaptive at-tention for visual captioning. IEEE trans. on pattern anal-ysis and machine intelligence, 2019.

[Gu et al., 2017] Jiuxiang Gu, Gang Wang, Jianfei Cai, andTsuhan Chen. An empirical study of language cnn for im-age captioning. In ICCV, pages 1222–1231, 2017.

[Guo et al., 2019] Longteng Guo, Jing Liu, Shichen Lu, andHanqing Lu. Show, tell and polish: Ruminant decod-ing for image captioning. IEEE Trans. on Multimedia,22(8):2149–2162, 2019.

[He et al., 2016] Kaiming He, Xiangyu Zhang, ShaoqingRen, and Jian Sun. Deep residual learning for image recog-nition. In CVPR, pages 770–778, 2016.

[Hochreiter and Schmidhuber, 1997] Sepp Hochreiter andJurgen Schmidhuber. Long short-term memory. Neuralcomputation, 9(8):1735–1780, 1997.

[Hodosh et al., 2013] Micah Hodosh, Peter Young, and Ju-lia Hockenmaier. Framing image description as a rankingtask: Data, models and evaluation metrics. Journal of Ar-tificial Intelligence Research, 47:853–899, 2013.

[Hu et al., 2018] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In CVPR, pages 7132–7141,2018.

[Huang et al., 2019a] Lun Huang, Wenmin Wang, Jie Chen,and Xiao-Yong Wei. Attention on attention for image cap-tioning. In ICCV, 2019.

[Huang et al., 2019b] Lun Huang, Wenmin Wang, YaxianXia, and Jie Chen. Adaptively aligned image captioningvia adaptive attention time. In NIPS, pages 8942–8951.2019.

[Karpathy and Fei-Fei, 2015] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating im-age descriptions. In CVPR, pages 3128–3137, 2015.

[Li and Jiang, 2019] Xiangyang Li and Shuqiang Jiang.Know more say less: Image captioning based on scenegraphs. IEEE Trans. on Multimedia, 21(8):2117–2130,2019.

[Li et al., 2016] Kai Li, Guo-Jun Qi, Jun Ye, and Kien AHua. Linear subspace ranking hashing for cross-modalretrieval. IEEE trans. on pattern analysis and machineintelligence, 39(9):1825–1838, 2016.

[Lin et al., 2014] Tsung-Yi Lin, Michael Maire, Serge Be-longie, James Hays, Pietro Perona, Deva Ramanan, PiotrDollar, and C Lawrence Zitnick. Microsoft coco: Com-mon objects in context. In ECCV, pages 740–755, 2014.

[Lin et al., 2017] Tsung-Yi Lin, Piotr Dollar, Ross Girshick,Kaiming He, Bharath Hariharan, and Serge Belongie. Fea-ture pyramid networks for object detection. In CVPR,pages 2117–2125, 2017.

[Lin, 2004] Chin-Yew Lin. Rouge: A package for automaticevaluation of summaries. In ACL workshop, pages 74–81,2004.

[Liu et al., 2017a] Chenxi Liu, Junhua Mao, Fei Sha, andAlan Yuille. Attention correctness in neural image cap-tioning. In AAAI, 2017.

[Liu et al., 2017b] Siqi Liu, Zhenhai Zhu, Ning Ye, SergioGuadarrama, and Kevin Murphy. Improved image caption-ing via policy gradient optimization of spider. In CVPR,pages 873–881, 2017.

[Loshchilov and Hutter, 2019] Ilya Loshchilov and FrankHutter. Decoupled weight decay regularization. In ICLR,2019.

[Lu et al., 2017] Jiasen Lu, Caiming Xiong, Devi Parikh,and Richard Socher. Knowing when to look: Adaptiveattention via a visual sentinel for image captioning. InCVPR, pages 375–383, 2017.

[Luo et al., 2018] Ruotian Luo, Brian Price, Scott Cohen,and Gregory Shakhnarovich. Discriminability objectivefor training descriptive captions. In CVPR, pages 6964–6974, 2018.

Page 11: Dual Attention on Pyramid Feature Maps for Image Captioning

[Mai et al., 2017] Long Mai, Hailin Jin, Zhe Lin, Chen Fang,Jonathan Brandt, and Feng Liu. Spatial-semantic imagesearch by visual feature synthesis. In CVPR, pages 4718–4727, 2017.

[Papineni et al., 2002] Kishore Papineni, Salim Roukos,Todd Ward, and Wei-Jing Zhu. Bleu: a method for au-tomatic evaluation of machine translation. In ACL, pages311–318, 2002.

[Qin et al., 2019] Yu Qin, Jiajun Du, Yonghua Zhang, andHongtao Lu. Look back and predict forward in image cap-tioning. In CVPR, pages 8367–8375, 2019.

[Rennie et al., 2017] Steven J Rennie, Etienne Marcheret,Youssef Mroueh, Jerret Ross, and Vaibhava Goel. Self-critical sequence training for image captioning. In CVPR,pages 7008–7024, 2017.

[Shen et al., 2018] Tao Shen, Tianyi Zhou, Guodong Long,Jing Jiang, Shirui Pan, and Chengqi Zhang. Disan: Di-rectional self-attention network for rnn/cnn-free languageunderstanding. In AAAI, 2018.

[Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, NikiParmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,Łukasz Kaiser, and Illia Polosukhin. Attention is all youneed. In NIPS, pages 5998–6008, 2017.

[Vedantam et al., 2015] Ramakrishna Vedantam,C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In CVPR, pages4566–4575, 2015.

[Vinyals et al., 2015] Oriol Vinyals, Alexander Toshev,Samy Bengio, and Dumitru Erhan. Show and tell: A neu-ral image caption generator. In CVPR, pages 3156–3164,2015.

[Wang et al., 2016] Cheng Wang, Haojin Yang, ChristianBartz, and Christoph Meinel. Image captioning with deepbidirectional lstms. In MM, pages 988–997, 2016.

[Wang et al., 2017] Bokun Wang, Yang Yang, Xing Xu,Alan Hanjalic, and Heng Tao Shen. Adversarial cross-modal retrieval. In MM, pages 154–162, 2017.

[Wu et al., 2020] Jie Wu, Tianshui Chen, Hefeng Wu, ZhiYang, Guangchun Luo, and Liang Lin. Fine-grained im-age captioning with global-local discriminative objective.IEEE Trans. on Multimedia, 2020.

[Xu et al., 2015] Kelvin Xu, Jimmy Ba, Ryan Kiros,Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov,Rich Zemel, and Yoshua Bengio. Show, attend and tell:Neural image caption generation with visual attention. InICML, pages 2048–2057, 2015.

[Xu et al., 2017a] Jun Xu, Ting Yao, Yongdong Zhang, andTao Mei. Learning multimodal attention lstm networks forvideo captioning. In MM, pages 537–545, 2017.

[Xu et al., 2017b] Xing Xu, Fumin Shen, Yang Yang,Heng Tao Shen, and Xuelong Li. Learning discriminativebinary codes for large-scale cross-modal retrieval. IEEETrans. on Image Processing, 26(5):2494–2507, 2017.

[Xu et al., 2020] Wanru Xu, Jian Yu, Zhenjiang Miao, LiliWan, Yi Tian, and Qiang Ji. Deep reinforcement polishingnetwork for video captioning. IEEE Trans. on Multimedia,2020.

[Yan et al., 2019] Chenggang Yan, Yunbin Tu, XingzhengWang, Yongbing Zhang, Xinhong Hao, Yongdong Zhang,and Qionghai Dai. Stat: spatial-temporal attention mech-anism for video captioning. IEEE trans. on multimedia,22(1):229–241, 2019.

[Yao et al., 2017] Ting Yao, Yingwei Pan, Yehao Li, Zhao-fan Qiu, and Tao Mei. Boosting image captioning withattributes. In CVPR, pages 4894–4902, 2017.

[Yao et al., 2018] Ting Yao, Yingwei Pan, Yehao Li, and TaoMei. Exploring visual relationship for image captioning.In ECCV, pages 684–699, 2018.

[Yao et al., 2019] Ting Yao, Yingwei Pan, Yehao Li, and TaoMei. Hierarchy parsing for image captioning. In ICCV,2019.

[You et al., 2016] Quanzeng You, Hailin Jin, ZhaowenWang, Chen Fang, and Jiebo Luo. Image captioning withsemantic attention. In CVPR, pages 4651–4659, 2016.

[Young et al., 2014] Peter Young, Alice Lai, Micah Hodosh,and Julia Hockenmaier. From image descriptions to visualdenotations: New similarity metrics for semantic inferenceover event descriptions. Trans. of the Association for Com-putational Linguistics, 2:67–78, 2014.

[Zhang et al., 2018a] Biao Zhang, Deyi Xiong, and JinsongSu. Neural machine translation with deep attention. IEEEtrans. on pattern analysis and machine intelligence, 2018.

[Zhang et al., 2018b] Zongjian Zhang, Qiang Wu, YangWang, and Fang Chen. High-quality image captioningwith fine-grained and semantic-guided visual attention.IEEE Trans. on Multimedia, 21(7):1681–1693, 2018.

[Zhao et al., 2017] Hengshuang Zhao, Jianping Shi, Xiao-juan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid sceneparsing network. In CVPR, pages 2881–2890, 2017.