11
Research Article Image-Text Joint Learning for Social Images with Spatial Relation Model Jiangfan Feng , Xuejun Fu, Yao Zhou, Yuling Zhu, and Xiaobo Luo Chongqing University of Posts and Telecommunications, College of Computer Science and Technology, Space Big Data Intelligent Technology Chongqing Engineering Research Center, Chongqing 400065, China Correspondence should be addressed to Jiangfan Feng; [email protected] Received 27 September 2019; Revised 30 December 2019; Accepted 11 February 2020; Published 28 March 2020 Academic Editor: Mahardhika Pratama Copyright © 2020 Jiangfan Feng et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. e rapid developments in sensor technology and mobile devices bring a flourish of social images, and large-scale social images have attracted increasing attention to researchers. Existing approaches generally rely on recognizing object instances individually with geo-tags, visual patterns, etc. However, the social image represents a web of interconnected relations; these relations between entities carry semantic meaning and help a viewer differentiate between instances of a substance. is article forms the perspective of the spatial relationship to exploring the joint learning of social images. Precisely, the model consists of three parts: (a) a module for deep semantic understanding of images based on residual network (ResNet); (b) a deep semantic analysis module of text beyond traditional word bag methods; (c) a joint reasoning module from which the text weights obtained using image features on self-attention and a novel tree-based clustering algorithm. e experimental results demonstrate the effectiveness of using Flickr30k and Microsoft COCO datasets. Meanwhile, our method considers spatial relations while matching. 1.Introduction With the rise of cheap sensors, mobile terminals, and social networks, research on social images is making good prog- ress, including image retrieval, object classification, and scene understanding. Compared with images in traditional applications, it is hard to understand social pictures using the low-level features. Meanwhile, most of the existing methods only capture the local patterns of images by uti- lizing low-level features (e.g., color and texture). Intuitively, knowing the spatial relation among local elements may help predict what objects and scenes are presented in the visual content. It has recently been widely adopted in the vision community that contextual information, i.e., the relation between objects, improves the accuracy of object recognition [1]. erefore, the geometry relation of objects in social images is usually exploited to conduct annotation, which depends on the similarity measurement of visual objects. Significant efforts have been taken to integrate visual and textual analyses [2–4]. For example, Wang et al. [5] present an algorithm to learn the relations between scenes, objects, and texts with the help of image-level labels. However, such a training process requires a large number of paired images and text data. Motivated by the success of the encoder-decoder network, studies have been proposed to apply it to generate text descriptions of the given images [6, 7]. Nevertheless, such impressive performance relies on the assumption that the training data and the test data should come from the same underlying data distribution. Some approaches [8, 9] exploit the spatial relations of objects indicated by the prepositions for image understanding. ey suffer from the limitations that spatial relations have to be learned with the bounding boxes of objects and cannot be driven by the task goal. Although there exist several successful image-text learning approaches or vision-based approaches to analyze social images, the following problems are still not addressed: (1) Visual content and text are always separately learned, making the traditional methods hard to be trained end-to-end. (2) Learning tasks converted to classification problems, empowered by large-scale annotated data with end- Hindawi Complexity Volume 2020, Article ID 1543947, 11 pages https://doi.org/10.1155/2020/1543947

Image-TextJointLearningforSocialImageswithSpatial ...Sep 27, 2019  · two scenarios are straightforward: given an input image (resp.sentence),thegoalistofindtherelationshipsbetween

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Image-TextJointLearningforSocialImageswithSpatial ...Sep 27, 2019  · two scenarios are straightforward: given an input image (resp.sentence),thegoalistofindtherelationshipsbetween

Research ArticleImage-Text Joint Learning for Social Images with SpatialRelation Model

Jiangfan Feng Xuejun Fu Yao Zhou Yuling Zhu and Xiaobo Luo

Chongqing University of Posts and Telecommunications College of Computer Science and TechnologySpace Big Data Intelligent Technology Chongqing Engineering Research Center Chongqing 400065 China

Correspondence should be addressed to Jiangfan Feng fengjfcqupteducn

Received 27 September 2019 Revised 30 December 2019 Accepted 11 February 2020 Published 28 March 2020

Academic Editor Mahardhika Pratama

Copyright copy 2020 Jiangfan Feng et al-is is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

-e rapid developments in sensor technology and mobile devices bring a flourish of social images and large-scale social imageshave attracted increasing attention to researchers Existing approaches generally rely on recognizing object instances individuallywith geo-tags visual patterns etc However the social image represents a web of interconnected relations these relations betweenentities carry semantic meaning and help a viewer differentiate between instances of a substance-is article forms the perspectiveof the spatial relationship to exploring the joint learning of social images Precisely the model consists of three parts (a) a modulefor deep semantic understanding of images based on residual network (ResNet) (b) a deep semantic analysis module of textbeyond traditional word bag methods (c) a joint reasoning module from which the text weights obtained using image features onself-attention and a novel tree-based clustering algorithm -e experimental results demonstrate the effectiveness of usingFlickr30k and Microsoft COCO datasets Meanwhile our method considers spatial relations while matching

1 Introduction

With the rise of cheap sensors mobile terminals and socialnetworks research on social images is making good prog-ress including image retrieval object classification andscene understanding Compared with images in traditionalapplications it is hard to understand social pictures usingthe low-level features Meanwhile most of the existingmethods only capture the local patterns of images by uti-lizing low-level features (eg color and texture) Intuitivelyknowing the spatial relation among local elements may helppredict what objects and scenes are presented in the visualcontent It has recently been widely adopted in the visioncommunity that contextual information ie the relationbetween objects improves the accuracy of object recognition[1] -erefore the geometry relation of objects in socialimages is usually exploited to conduct annotation whichdepends on the similarity measurement of visual objects

Significant efforts have been taken to integrate visual andtextual analyses [2ndash4] For example Wang et al [5] presentan algorithm to learn the relations between scenes objects

and texts with the help of image-level labels However such atraining process requires a large number of paired images andtext data Motivated by the success of the encoder-decodernetwork studies have been proposed to apply it to generate textdescriptions of the given images [6 7] Nevertheless suchimpressive performance relies on the assumption that thetraining data and the test data should come from the sameunderlying data distribution Some approaches [8 9] exploit thespatial relations of objects indicated by the prepositions forimage understanding -ey suffer from the limitations thatspatial relations have to be learned with the bounding boxes ofobjects and cannot be driven by the task goal

Although there exist several successful image-textlearning approaches or vision-based approaches to analyzesocial images the following problems are still not addressed

(1) Visual content and text are always separately learnedmaking the traditional methods hard to be trainedend-to-end

(2) Learning tasks converted to classification problemsempowered by large-scale annotated data with end-

HindawiComplexityVolume 2020 Article ID 1543947 11 pageshttpsdoiorg10115520201543947

to-end training using neural networks which is notcapable of describing concepts unseen in the trainingpairs

(3) -e spatial relations defined by prepositions have tobe learned with the bounding boxes of objects whichare so immoderately challenging to obtain More-over the spatial relationships from the textual de-scriptions are very scarce in reality

Motivated by these observations we aim at developing amethod to learn the spatial relations across separate visualobjects and texts for social image understanding -ereforethis paper proposes a cross-modal framework which buildsa joint model of texts and images to extract features andcombine the advantages of self-attention mechanism anddeep learning models generating interactive effects Inparticular we investigate (1) how to label social images withhigh-level features based on their political image positionand (2) how to combine the text and visual content -eframework is established by taking spatial relationships as abasic unit of image-text joint learning We use neural ar-chitectures to measure the semantic similarity between vi-sual data eg images or regions and text data egsentences or phrases Learning this similarity requiresconnecting low-level pixel values and high-level languagedescriptions and then a joint latent space of standard di-mensionality in which matching image and text features hashigh cosine similarity to explore the semantics hidden in-side a social image

(1) We propose a framework that jointly trains two dualtasks the spatial semantic of image and text-to-image synthesis which improves the supervision andthe generalization performance of social imagecaption

(2) We extend the conventional model by adding thetop-down attention mechanism With a novel tree-based clustering method it can demonstrate theeffectiveness in learning the alignments of visualconcepts of images and the semantics of texts

2 Related Works

-e related works generally fall into two categories imagetagging and relational inference methods

21 Image Tagging Image tagging has been widely studiedfor the past decade Early image tagging models are builtmainly from the view of statistics and probability Inpractice many image annotation works [10 11] assign top-kclass labels to each image the quantities of class labels indifferent photos vary significantly and the top-k annotationsdegrade the performance of image annotation Besidesmany works attempt to infer correlations or joint probabilitydistributions between images and semantic concepts (orkeywords) For example Farhadi et al [12] use detectionmethods to understand scene elements Similarly Li et al[13] start with exposures and piece a final description usingphrases containing detected objects and relationships

Further powerful language models based on languageparsing have been used as well [14] Recently deep learninghas achieved great success in the field of image text andspeech for example the m-RNN model [15] in which amultimodal component is introduced to explicitly connectthe language model and the vision model by a one-layerrepresentation

Images can label at the pixel level which has applicationsin intelligent video monitor and self-driving cars etc Morerecently the variable label number problem has beenidentified [16 17] -ese solutions treat the image anno-tation problem as an image-to-text translation problem andsolve it using an encoder-decoder model -e multiscaleapproaches [18] propose a novel multiscale deep model forextracting rich and discriminative features capable of rep-resenting a wide range of visual concepts Instead of CNNfeatures some works use more semantic information ob-tained from the image as the input to the decoder [19 20] Inall most methods still focus on recognizing objects sepa-rately -e spatial relationships between objects and scenesare always neglected

22 Relational Inference -e earliest reasoning form candate to a symbolic way and the ties between symbols areestablished in logical and mathematical language which isinterpreted in terms of deduction arithmetic and algebraAs symbolic approaches suffer from the symbol groundingproblem they are not robust to small tasks and inputvariations [21 22] Many other methods such as deeplearning in a traditional inference network the inferencepart may be multilayer Perceptron (MLP) Long Short-TermMemory (LSTM) or LSTM with attention which often runsinto the problem of weak data [23 24] Santoro proposed arelation network to achieve the reasoning part -is struc-ture clearly expresses two ideas the final answer has to dowith pairs of objects and the problem itself will influencehow to examine the objects As an active research field somerecent works have also applied neural networks to structuredgraph data or tried to standardize network output throughrelations and knowledge bases We believe that visual datareasoning should be both local and global abandoning thetwo-dimensional image structure to involve the task is notonly useful but also invalid Beyond the object detection taskthat detects relationships for an image the scene graphgeneration task [25 26] endows the model with an entirestructured representation capturing both objects and theirsemantic relationships where the nodes are object instancesin images and the edges depict their pairwise correlations-e proposed methods usually require additional annota-tions of relations while they demand only image-levelannotations

3 Cross-Modal Reasoning Framework

31 Overview of Cross-Modal Tasks and Pipeline In thispaper we focus on two image-text tasks spatial relationmodeling and image-text matching -e former refers bothto image-to-image and image-to-text and definitions of the

2 Complexity

two scenarios are straightforward given an input image(resp sentence) the goal is to find the relationships betweenentities with semantic meaning -e second task refers tofind the best matching sentences to the input images

-e architecture for the above tasks should consist offour-step pipeline as summarized here -e First step is todesign functional feature extractors we use a refined TF-IDFmethod [27] to calculate the word frequency and thencombine it with the embedding vector which helps us tomap it nonlinearly into another vector space to enhancesemantic association in words and lower its dimension -esecond step is to generate full image features and geo-graphical features We can deepen the network continu-ously getting saturated then degrades rapidly [28] -e thirdstep is to add a self-attention mechanism using the imagefeature to get the weight of words -en we combine theimage and text features together to deduce the spatial re-lation of the picture -e final step performs instance rec-ognition which is a deep semantic understanding of socialimages

At a conceptual level there are two branches to achievethe goal One is to train the network to map images and textsinto joint embedding space-e second approach is to framepictures and documents correspondence by cosine simi-larity to obtain the probability that the two items matchAccordingly we define a cross-modal reasoning frameworkto exploit image-text joint learning of social image retrievalwith spatial relation model which includes two variants oftwo-branch networks that follow these two strategies(Figure 1) the embedding network and the similaritynetwork

Two networks are needed in this framework one forcross-modal topic detection and the other for semanticmatching which is trained one by one -e embeddingnetwork refers to cross-domain spatial relation modeling asillustrated in Figure 1(a) -e spatial relation includes bothimage-to-text and image-to-image and the goal is to modelspatial relationships in CNN based detection -e similaritynetwork is illustrated in Figure 1(b) we first pretrain imagedata and text-image data respectively and then fine-tuned onthe target domain training data via reinforcement learning Indetail we use ResNet-50 as the CNN encoder of the frame-work to learn the social image annotations By adding thecross-domain spatial relation model the structure can attendthe crucial parts of the image when decoding different wordsof the caption to generate interactive effects Joint models withimage-text similarity are given by cosine similarity [29]

32 Cross-Domain Spatial Relation Modeling -e spatialrelation between visual content and texts is presented as twolevels First it represented by the matching probability of thepurpose and second by the spatial relationships between theimage objects represented by the interaction of the objects indifferent tasks

321 Spatial Relation between Images and TextsMotivated by attention-based deep methods [30 31] we firstreview a basic attention module called self-attention As

illustrated in Figure 2 the input consists of queries and keysof dimensions dk and dv -e dot product is performedbetween the question and all keys to obtain their similarityA softmax function is applied to obtain the weights on thevalues Given keys and values the output value is theweighted average over input values

Attention(Q K V) softmaxQKT

dk

11139681113888 1113889V

Attention(Query Source) 1113944

Lx

i1Similarity QueryKeyi( 1113857lowastValuei

(1)

In our case we apply modifications to the output valuesSpecially we define attention as follows the source consistsof a set of two-dimensional vectorslt key valuegt given anelement of a target named query calculating the similarity orrelevance of the question and each to get a weight coefficientof every key corresponding to the value-e weighted sum isperformed by

Attention(Query Source) 1113944Lx

i1Similarity QueryKeyi( 1113857

middot Valuei

(2)

where x is the vector of an input image or word Lx is thelength of x Query is the image vector keyiis the word vectorand the dot product is performed to calculate similarity

We apply the perceptron to calculate the weight of theword vector In self-attention each word can compute allterms with attention -e aim is to learn the internal worddependence and to capture the internal structure of the text-e characteristics of self-attention lie in ignoring the dis-tance of sentence to an image directly calculating its innerstructure to study the internal structure of a sentenceFurther the realization of parallel computing can also berelatively simple

Inspired by the thought of learning to rank [32 33] wemeasure the similarity between two samples using cosinedistance

D fxi fxj

1113874 1113875 fxi

fxi

2

middotfxj

fxj

2

(3)

wheremiddot2means L2-norm and (fxi fxj

) isin [minus1 1] for ef-fectively taking two modalities into account and the rankingloss can be written as

Lrank max 0 a minus D fIa fTa( 1113857( 11138571113858 1113859Ianchor

+ max 0 a minus D fIa fTa( 1113857( 11138571113858 1113859Tanchor(4)

where I denotes the visual input T denotes the text inputand α is a margin -en we use the idea of triplet loss [34]which is one of the very widely used similarity functionsTriplet loss function consists of (a) a sample called Anchor(write to xa) that is chosen randomly from the training dataset

Complexity 3

(b) a same-class sample as Anchor called Positive (write to xp)and (c) an example that class is different from Anchor calledNegative (write toxn)-e purpose of triplet loss is tominimizethe distance between xa and xp and maximize the distance

between xa and xn We call this kind of ranking loss the tripletranking loss and the formulation representation is

Lrank xa xp xn1113872 1113873 max 0 α + D xa xp1113872 1113873 minus D xa xn( 11138571113872 11138731113960 1113961

(5)

Key 1

F(QK)

S1

Key 2

F(QK)

S2

Key 3

F(QK)

S3

Key i

F(QK)

Si

Somax()

A1 A2 A3 Ai

Value 1 Value 2 Value 3 Value i

+ + + +Attention value

Query1

2

3lowast lowast lowast lowast

Figure 2 Example to illustrate the process of building an attention model

A rock climber climbs in between two very large rocks hellip

ResNet-50

Embedding network(Cross-domain spatial relationship modeling)

Word2Vec TF-IDF

Word vectorImage vector

Image and wordvector

Self-attention

Spatial relationmodel

Image and word vectorwith spatial relation model

Ranking loss

(a)

Similarity network(Cross-modal matching)

Word vectorImage vector

Kmeans++recursive clustering

Savefinger_prints

ResNet-50

Imagevector

Surfing surfs surfboard

Matching

Tree-based clustering

(b)

Figure 1 -e cross-modal reasoning framework (a) Embedding network (b) Similarity network

4 Complexity

322 Spatial Relation in Image Objects We now describeobject relation computation -e basic idea of the spatialrelationships between image objects have their roots in thecrucial inference a visual or hidden object in the image spacetends to concentrate for relevant purposes but distributerandomly for irrelevant ones Let an object consists of itscoordinate feature fC and appearance feature fA In thiswork fC is a 4-dimensional object bounding box withrelative image coordinates and fA is related to the task

To be specific given input set of N visual or hiddenobjects (fn

C fnA)1113864 1113865

N

n1 the relation feature fR(n) of the wholesets concerning the nth object is computed as

fR(n) 1113944m

ωmnmiddot WV middot f

mA( 1113857 (6)

-e output is a weighted sum of appearance featuresfrom other purposes linearly transformed by WV which iscorresponding to values V in equation (1) -e relationweight ωmn indicates the impact of other objects WV andωmn can be calculated by the object-relations model [35]-ere are two steps First the coordinate features of the twoobjects are embedded in a high-dimensional representationby ResNet Inspired by the widely used bounding box re-gression method DPM [36] the elements are transformedusing log(middot) to calculate distant objects and close-by objectsSecond the embedded feature is turned into a scalar weightand trimmed -e trimming operation restricts relationsonly between objects of individual spatial relationshipswhich is related to the task and knowledge

-e cross-domain spatial relation model has the samedimension of images and texts at the output which can be anessential building block in the framework

33 Similarity Network

331 Image Representation As the gradient vanishingproblem prevents the deep network from being fully learnedwe adopt the ResNet-50 to avoid loss of accuracy by learningresidual functions to reformulate the layers concerning theinputs instead of learning unreferenced functions -ebuilding block is defined as

y f x wi1113864 1113865( 1113857 + x (7)

where x is the input vector and y is the output vector of thelayers considered -e function f is on behalf of the residualmapping to learn

As a building block has two line shortcuts we consideranother option Due to the shortcut connection and ele-ment-wise addition it operates F + x In equation (7) thenumbers of channels of x and y are equal If it is the oppositesituation we adopt the calculation

y f x wi1113864 1113865( 1113857 + wsx (8)

where ws is the operation of convolution to adjust thechannel dimension of x

We learn the image representation vectors from thelower convolutional layer In this manner the decoder canattend to specific parts of an image by selecting a subset of

the feature vectors In this study we employed 2048 featuremaps of the 50-layer residual network

332 Text Representation Prevalent model architecture forNatural Language Processing (NLP) is one-hot codingUnfortunately when the output space grows featurescannot properly span the full feature space consequentlyone-hot encoding might result insufficient for fine-grainedtasks since the projection of the outputs into a higher-di-mensional space dramatically increase the parameter spaceof computed models Also for datasets with a large numberof words the ratio of samples per word is typically reduced

A straightforward way to solve the limitations men-tioned above is to relax the problem into a real-valued linearprogramming problem and then threshold the resultingsolution We combine the vector of word embedding andfrequency by TF-IDF [27] which depicts the occurrencefrequency of a word in all texts not just the number ofoccurrences It allows each word to establish a global contextand transforms the high-dimensional sparse vector into alow-dimensional dense vector Specially we use self-atten-tion as it will enable each word to learn its relation to otherwords

333 Cross-Modal Matching

Cosine Similarity -e angle between cosines can effectivelyavoid the differences of degrees in the same cognition ofindividuals and pay more attention to the differences be-tween dimensions rather than the differences in numericalvalues We use the extracted image feature vector to allocatethe weight of the word vectors rather than getting the weightof the word from the text -en we use the resulting weightand the corresponding word vector to do dot multiplicationwith image vector to get the similarity between image andtext -is allows the semantics of the image and text tointeract

Tree-based Clustering Vector Quantization Algorithm Letx Rd be the d-dimensional instance and y 1 2 q1113864 1113865 bethe class space with a set of q possible classes Note that twotasks are needed in this term tree-based training and re-trieval After obtaining the features of the images and textsthe image and text are in the same vector space and we canuse a scalable K-means++ clustering algorithm [37] to bothimage and text vector

We develop a tree-based algorithm for cross-modallearning which presents a tree-based classificationmodel formulticlass learning A tree-based structure is constructedwhere the root node corresponds to the classes in thetraining dataset Each node v contains a set of k-means++classifiers -e top node contains all the training data At thetop level the whole dataset is partitioned into five datasubsets A B C D E -e instances are recursively par-titioned into smaller subsets while moving down the treeEach internal node contains the training instances of itschild nodes Especially each node in the tree contains twocomponents cluster vectors and predictive class vectors

Complexity 5

-e cluster vector is a vector with real values to measureclusters at a node and we adopt the definition as

pv(n) 1113936xiisinDV

Cin

Dv

11138681113868111386811138681113868111386811138681113868

(9)

wherepv(n) is thenth component ofpv |Dv| is the total numberof instances in node v Ci Yi1 Yi2 Yiq1113966 1113967 isin 0 1 q-e predictive class vector is a vector with boolean values in-dicating themembership of predictive classes at a node-evalueis 1 when pv(n) is larger than the threshold It implies that thenode is the proper class or subclass for instances of node v Notethat Ci and the threshold are obtained from the training process

-e algorithm uses three stopping criteria to determinewhen to stop growing the tree (1) A node is a leaf node ifidentified as a predictive class (2) A node is a leaf node if thedata cannot be partitioned to more than three clusters usingthe classifiers (3) -e tree gets the max depth

Figure 3 shows the flowchart of the training and retrievalsteps We save the tree model we calculate the position ofthe real image or text in the leaf nodes the path is thepictured fingerprint we save all picture fingerprints andthen the tree model is built When matching it is alsonecessary to extract image features first Starting from theroot node each clustering model recursively predicts thecategory of the vector at a different level After the image hasfallen to the leaf nodes output the path of the leaf nodes asthe fingerprint of the picture Use cosine distance to find thesame text with the fingerprint and sort it to get the an-notation -e process is shown in the Algorithms 1 and 2

4 Experiments and Results

41 Datasets and Evaluation We use the Flickr30k andMicrosoft COCO to evaluate our proposed method which iswidely used in caption-image retrieval and image captiongeneration tasks As a popular benchmark for annotationgeneration and retrieval tasks Flickr30k contains 31783images focusing mainly on people and animals and 158915English captions (five per image) We select 30783 imagesrandomly for training and another 1000 for testing splits

Microsoft COCO is the largest existing dataset with bothcaptions and region-level annotations are which consists of82783 training images and 40504 validation images and fivesentences accompany each image We use a total of 82783images for training and testing splits with 5000 images -etesting splits are images with more than three instanceswhich are selected from the validation dataset For eachtesting image we use the publicly available division which iscommonly used in the caption-image retrieval and captiongeneration tasks

42 Implementation Details We use the ResNet-50 to ex-ploit 2048 feature maps with a size of 1times1 in ldquoconv5_3rdquowhich helps us to integrate the high-level features with thelower ones and the visualization results are provided inFigure 4 By adding a full-connection layer of which di-mension is 128 we name the output of graph embedding v1In the text representation the input layer is followed by two

full-connection layers with dimensions d1 and d2 respec-tively -e output of the last full-connection layer is theoutput of text module embedding and we name it v2 Weadd a self-attention mechanism to two embedding networksand calculate to get v1prime v2prime -en v1prime and v2prime are connected to atriplet ranking loss Further we use Adam optimizer to trainthe system

After offline debugging we finally set the parameter ofthe input layer length of the text module to be 100000 Afterthe statistics of the word occurrence number of all sentencesegmentation words we take the word occurrence numberof the first 99999 and the rest words as a new word In totalthe number of neurons of the input layer is 100000

As for the text domain the second neuron number is512 the third is 128 Meanwhile the length of the graphembedding is also 128 We set the margin (in equation (6))and the parameters of Adamrsquos algorithm are as followslr 0001 beta_1 09 beta_2 0999 epsilon 1eminus08decay 00

For speeding up the training we conducted negativesampling -e sample sentence that matches the currentimage is positive To thoroughly train the low-frequencysentences we randomly select the part of the sentenceoutside the sample as negative samples

43 Result of Experiments Microsoft COCO consists of123287 images and 616767 descriptions Five text descrip-tions accompany each image We randomly select thetraining with public 82783 images and remain 5000 imagesas test data We can see that our loss function is slightlybetter than others in Figure 5 when the number of iterationswas over 400 -e Softmax loss optimizes the distance be-tween classes being great but it is weak when it comes tooptimizing the range within categories -e triplet rankingloss addresses this problem to some extent

As is shown in Figure 6 we can observe that the wordssuch as ldquoclimbsrdquo ldquorest atrdquo ldquoin therdquo which show the spatialrelationships between objects get significantly higher scoresthan other words while things like ldquotherdquo and ldquoardquo always havelower matching scores However some prepositions con-taining positional relations such as ldquobyrdquo in the right-uppercorner have higher scores than other prepositions

-e use of attention causes features related to spatialrelation traits to be weighted As is shown in Figure 7 such asthe image in the right-upper corner we can infer from thecoins in the cup next to the person lying down that he isbegging

44 Comparison with the State-of-the-Art

441 Qualitative Evaluation To evaluate the effectivenessof the proposed method we first compare its classificationperformance with state-of-the-art performance reported inthe literature We manually labelled the ground truthmapping from the nouns to the object classes By varying thethreshold the precision-recall curve can be sketched tomeasure accuracy We commonly used ldquoR1rdquo ldquoR5rdquo andldquoR10rdquo ie recall rates at the top 1 5 and 10 results

6 Complexity

Raw images input

Image features frominput image by ResNet

Feature clustering

Stopping criteria

Recurrent clustering

Saving fingerprints

N

Y

(a)

Image features frominput image by ResNet

Class prediction

Leaf node

Fingerprints asoutput

N

Y

Raw image input

Search for thesimilar fingerprint

Order by thecomputed distance

(b)

Figure 3 Flowchart of training and searching steps for tree-based algorithm (a) Flowchart of training steps for tree-based algorithm (b)Flowchart of searching steps for tree-based algorithm

Input M an social imageD the current training instancesN cluster numbersR a set of features to represent an imageFP fingerprints for the imageT cluster tree

Output the fingerprints for M(1) Initialize N R T FP and resize M(2) compute R by ResNet to represent M(3) Repeat(4) create a node v(5) update R according to N clusters to M with k-means++(6) let Dj be the set of data in D satisfying outcome j(7) attach the node returned by T(Dj) to node v(8) until Stop(D R) is true(9) return FP as a finger for the image and attach leaf nodes to FP

ALGORITHM 1 Build fingerprints with a cluster tree

Input M an social imageMD image dataset for searchN number of the image datasetFP fingerprint for the image

Output S top-k similar images for image M(1) Initialize M MD FP S and resize images(2) M and MD is represented as R by ResNet(3) Checking procedure to find the top-k similar images to M in MD(4) For each image from MD get a cosine similarity by FP of the image pair(5) return top-k similar images order by cosine similarity

ALGORITHM 2 Find top-k similar images by fingerprints

Complexity 7

(a)

0

20

40

60

80

0 20 40 60 80 100 120

(b)

Figure 4 -e visualization of ResNet-50 (a) After 50 layers of convolution (b) Fusion by 1 1 ratio

20

25

30

35

40

50

45

0 10 20 30 40 50Iter lowast 10

Reca

ll

Ranking loss

Center lossSomax

Figure 5 Recall curves when training on Microsoft COCO

A ndash020Rock 071Climber 029Climbs 032In ndash012Between 003Two ndash013Very ndash007Large 008Rocks 073

(a)

A ndash015Group 006Of ndash009Mountain 067Climbers 051Rests 054At ndash011e ndash008Summit 018

(b)

Two 003Boys 035In ndash009A ndash012Field 057Kicking 041A ndash012Soccer 022Ball 019

(c)

A ndash017Man 046Surfs 054In ndash012e ndash026Water 021

(d)

Figure 6 Similarity matching between image and text

8 Complexity

As shown in Tables 1 and 2 we compare our proposedmethod with some of the latest and most advanced processeson the Flickr30k andMicrosoft COCO datasets We can findthat our approach works better than the compared methodsDifferent fromDSPE+FVlowast that uses external text corpora tolearn discriminative sentence features our model learnsthem directly from scratch in an end-to-end manner Whencomparing among our methods we can conclude as follows(a) our attention scheme is sufficient since the model withattention consistently outperforms those without notice on

both datasets (b) using ResNet as the underlying network tounderstand the deep semantics of images to get the spatialrelation features with the help of text context relations

It takes 235 seconds to use a brute force algorithm and 18seconds to mark with a tree-based clustering algorithm onMicrosoft COCO datasets

442 Quantitative Evaluation We also report quantitativeevaluation results with the frequently used BLEUmetric [46]for the proposed datasets -e results for Microsoft COCOand Flickr30K datasets are listed in Tables 3 and 4-e imageencoders of all methods listed here are either VGG-Net orResNet which are prevalent in this field We also report theablation test in terms of discarding the spatial relationmodel

-e results demonstrate that our method has a betterperformance than discarding the spatial relation model

(a) (b)

(c) (d)

Figure 7 Examples of attending to the correct object (white indicates the attended regions words below the figure indicate the labelling ofspatial relation features) (a) Havest (b) Begging (c) Search (d) Salvage

Table 1 Results of caption-image retrieval on the Flickr30Kdataset

R1 R5 R10RVP (T+ I) [38] 119 277 477CDCCA [39] 168 393 530MNLM [40] 230 507 629BRNN [41] 222 482 614DSPE+ FVlowast [42] 403 689 799m-RNN [15] 339 651 763m-CNN [43] 346 647 768Softmax-VGG 218 507 617Softmax-ResNet 223 529 632Softmax-ResNet-Attention 356 654 756Ranking-ResNet-Attention 413 691 813lowastUsing external text corpora -e ensemble or multimodel methods

Table 2 Results of caption-image retrieval on theMicrosoft COCOdataset

R1 R5 R10DSPE+ FVlowast 494 761 864m-RNN 417 727 835Sg-LSTM [44] 428 739 847MNLM 439 757 858DVSA [45] 392 699 805m-CNN 431 739 841Softmax-VGG 475 749 843Softmax-ResNet 491 761 852Softmax-ResNet-Attention 503 775 875Ranking-ResNet-Attention 527 793 887lowastUsing external text corpora -e ensemble or multimodel methods

Table 3 -e results of different caption methods on the MicrosoftCOCO

Bleu-1 Bleu-2 Bleu-3 Bleu-4SCA-CNN [19] 0714 0543 0409 0309Semantic ATT [30] 0699 0415 0306 0232SCN-LSTM [20] 0726 0581 0421 0318Softmax-ResNet-Attention 0736 0587 0423 0325Ranking-ResNet-Attention 0741 0598 0438 0319Softmax-ResNet-Attentionlowast 0703 0524 0413 0311Ranking-ResNet-Attentionlowast 0705 0540 0416 0313lowastDiscarding spatial relation model

Table 4 -e results of different caption methods on the Flickr30K

Bleu-1 Bleu-2 Bleu-3 Bleu-4SCA-CNN [19] 0705 0515 0402 0304Semantic ATT [30] 0694 0427 0301 0224SCN-LSTM [20] 0718 0569 0413 0308Softmax-ResNet-Attention 0715 0573 0429 0325Ranking-ResNet-Attention 0729 0581 0416 0319Softmax-ResNet-Attentionlowast 0706 0533 0406 0301Ranking-ResNet-Attentionlowast 0712 0542 0418 0309lowastDiscarding spatial relation model

Complexity 9

Besides our approach is slightly better than the proposedstate-of-the-art methods which verifies the efficiency oftopic-condition in image-captioning Note that our modeluses ResNet-50 as the encoder which is a simple attentionmodel-us our approach is competitive with those models

5 Discussion and Conclusion

-is paper proposes an integrated model to recognize in-stances and objects jointly by leveraging the associatedtextual descriptions and presents a learning algorithm toestimate the model efficiently -e learning process requiresonly separate images and texts without high-level captionsWe use the residual network to deepen the learning of imagesemantic and combine with the text to obtain some hiddenrelation features contained in the picture By constructing ajoint inference module with self-attention we make a fusionof local and global elements We also show that integratingimages and text for deep semantic understanding to label thespatial relation features Furthermore the use of a treeclustering algorithm accelerates the matching process Ex-periments verify that the proposed method achieves com-petitive results on two generic annotation datasets Flickr30kand Microsoft COCO

Data Availability

-e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

-e authors declare no conflicts of interest

Acknowledgments

-e project was sponsored by the National Nature ScienceFoundation of China (41571401 and 41871226) and NaturalScience Foundation of Chongqing China (cstc2019jcyj-msxmX0131)

References

[1] J Zhu L Fang and P Ghamisi ldquoDeformable convolutionalneural networks for hyperspectral image classificationrdquo IEEEGeoscience and Remote Sensing Letters vol 15 no 8pp 1254ndash1258 2018

[2] W Lu J Li T Li W Guo H Zhang and J Guo ldquoWebmultimedia object classification using cross-domain corre-lation knowledgerdquo IEEE Transactions on Multimedia vol 15no 8 pp 1920ndash1929 2013

[3] M Katsurai T Ogawa and M Haseyama ldquoA cross-modalapproach for extracting semantic relationships betweenconcepts using tagged imagesrdquo IEEE Transactions on Mul-timedia vol 16 no 4 pp 1059ndash1074 2014

[4] R Krishna I Chami M Bernstein and L Fei-Fei ldquoReferringrelationshipsrdquo in Proceedings of the IEEECVF Conference onComputer Vision and Pattern Recognition pp 6867ndash6876 SaltLake City UT USA June 2018

[5] B Wang D Lin H Xiong and Y F Zheng ldquoJoint inferenceof objects and scenes with efficient learning of text-object-

scene relationsrdquo IEEE Transactions on Multimedia vol 18no 3 pp 507ndash520 2016

[6] K Fu J Jin R Cui F Sha and C Zhang ldquoAligning where tosee and what to tell image captioning with region-basedattention and scene-specific contextsrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 39 no 12pp 2321ndash2334 2017

[7] Q Wu C Shen L Liu A Dick and A van den HengelldquoWhat value do explicit high level concepts have in vision tolanguage problemsrdquo in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR) pp 203ndash212 Las Vegas NV USA June 2016

[8] M J Choi A Torralba and A S Willsky ldquoA tree-basedcontext model for object recognitionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 34 no 2pp 240ndash252 2012

[9] R Mottaghi X Chen X Liu et al ldquo-e role of context forobject detection and semantic segmentation in the wildrdquo inProceedings of the IEEE Conference on Computer Vision andPattern Recognition Columbus OH USA June 2014

[10] H Hu G-T Zhou Z Deng Z Liao and G Mori ldquoLearningstructured inference neural networks with label relationsrdquo inProceedings of the Computer Vision and Pattern Recognitionpp 2960ndash2968 Las Vegas NV USA June 2016

[11] J Johnson L Ballan and L Fei-Fei ldquoLove thy neighborsimage annotation by exploiting image metadatardquo in Pro-ceedings of the IEEE International Conference on ComputerVision pp 4624ndash4632 Tampa FL USA December 2015

[12] A Farhadi M Hejrati M A Sadeghi et al ldquoEvery picture tellsa story generating sentences from imagesrdquo in Proceedings ofthe European conference on computer vision Springer BerlinHeidelberg 2010

[13] S Li G Kulkarni T L Berg A C Berg and Y ChoildquoComposing simple image descriptions using web-scalen-gramsrdquo in Proceedings of the Conference on ComputationalNatural Language Learning Portland OR USA June 2011

[14] M Mitchell X Han J Dodge et al ldquoGenerating image de-scriptions from computer vision detectionsrdquo in Proceedings ofthe 13th Conference of the European Chapter of the Associationfor Computational Linguistics Avignon France 2012

[15] J Mao W Xu Y Yang J Wang Z Huang and A YuilleldquoDeep captioning with multimodal recurrent neural networks(m-RNN)rdquo in Proceedings of the ICLR pp 1ndash17 San DiegoCA USA May 2015

[16] H Hu G-T Zhou Z Deng Z Liao and G Mori ldquoLearningstructured inference neural networks with label relationsrdquo inProceedings of the CVPR pp 2960ndash2968 Las Vegas NV USAJune 2016

[17] F Liu T Xiang T M Hospedales W Yang and C SunldquoSemantic regularisation for recurrent image annotationrdquo2016 httpsarxivorgabs161105490

[18] Y Niu Z Lu J-R Wen T Xiang and S-F Chang ldquoMulti-modal multi-scale deep learning for large-scale image an-notationrdquo IEEE Transactions on Image Processing vol 28no 4 pp 1720ndash1731 2019

[19] L Chen H Zhang J Xiao et al ldquoSpatial and channel-wiseattention in convolutional networks for image captioningrdquo inProceedings of IEEE Conference Computer Vision and PatternRecognition pp 6298ndash6306 Honolulu Hawaii July 2017

[20] Z Gan ldquoSemantic compositional networks for visual cap-tioningrdquo in Proceedings of IEEE Conference Computer Visionand Pattern Recognition pp 1141ndash1150 Honolulu HI USAJuly 2017

10 Complexity

[21] A Santoro D Raposo D G Barrett et al A Simple NeuralNetwork Module for Relational Reasoning pp 4967-4976

[22] S Harnad ldquo-e symbol grounding problemrdquo Physica DNonlinear Phenomena vol 42 no 1ndash3 pp 335ndash346 1990

[23] M Garnelo K Arulkumaran and M Shanahan ldquoTowardsdeep symbolic reinforcement learningrdquo 2016 httparxivorgabs160905518

[24] B M Lake T D Ullman J B Tenenbaum andS J Gershman ldquoBuilding machines that learn and think likepeoplerdquo Behavioral and Brain Sciences vol 40 2017

[25] D Xu Y Zhu C B Choy and L Fei-Fei ldquoScene graphgeneration by iterative message passingrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognitionpp 3097ndash3106 Honolulu Hawaii July 2017

[26] Y Li W Ouyang B Zhou K Wang and X Wang ldquoScenegraph generation from objects phrases and region captionsrdquoin Proceedings of the IEEE International Conference onComputer Vision pp 1270ndash1279 Venice Italy October 2017

[27] I Yahav O Shehory and D Schwartz ldquoComments miningwith TF-IDF the inherent bias and its removalrdquo IEEETransactions on Knowledge and Data Engineering vol 31no 3 pp 437ndash450 2019

[28] K He X Zhang S Ren et al ldquoDeep residual learning forimage recognitionrdquo in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition pp 770ndash778 LasVegas NA USA July 2016

[29] F Schroff D Kalenichenko and J Philbin ldquoFaceNet a unifiedembedding for face recognition and clusteringrdquo in Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition pp 815ndash823 Boston MA USA 2015

[30] Q You H Jin Z Wang C Fang and J Luo ldquoImage cap-tioning with semantic attentionrdquo in Proceedings of IEEE theConference Computer Vision and Pattern Recognitionpp 4651ndash4659 Las Vegas NV USA June 2016

[31] D Bahdanau K Cho and Y Bengio ldquoNeural Machinetranslation by jointly learning to align and translaterdquo 2014httparxivorgabs14090473

[32] A Karpathy A Joulin and L F Fei-Fei ldquoDeep fragmentembeddings for bidirectional image sentence mappingrdquo inProceedings of the International Conference on Neural Infor-mation Processing Systems pp 1889ndash1897 Montreal Canada2014

[33] H Nam J Ha and J Kim ldquoDual attention networks formultimodal reasoning and matchingrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognitionpp 2156ndash2164 Honolulu HI USA July 2017

[34] P Young A Lai M Hodosh and J Hockenmaier ldquoFromimage descriptions to visual denotations new similaritymetrics for semantic inference over event descriptionsrdquoTransactions of the Association for Computational Linguisticsvol 2 pp 67ndash78 2014

[35] H Hu J Gu Z Zhang J Dai and Y Wei ldquoRelation networksfor object detectionrdquo in Proceedings of the IEEECVF Con-ference on Computer Vision and Pattern Recognitionpp 3588ndash3597 Salt Lake City UT USA June 2018

[36] P Felzenszwalb R Girshick D McAllester and D RamananldquoObject detection with discriminatively trained part basedmodelsrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 32 no 9 2010

[37] B Bahmani ldquoScalable K-Means++rdquo in Proceedings of the VldbEndowment 5(7) pp 622ndash633 Istanbul Turkey August 2012

[38] R Kiros R Salakhutdinov and R S Zemel ldquoUnifying visual-semantic embeddings with multimodal neural languagemodelsrdquo 2014 httparxivorgabs14112539

[39] Y Yu S Tang K Aizawa and A Aizawa ldquoCategory-baseddeep CCA for fine-grained venue discovery from multimodaldatardquo IEEE Transactions on Neural Networks and LearningSystems vol 30 no 4 pp 1250ndash1258 2019

[40] J Johnson A Alahi and L Fei-Fei ldquoPerceptual losses for real-time style transfer and super-resolutionrdquo Computer Vision-ECCV 2016 pp 694ndash711 2016

[41] M Sinecen ldquoComparison of genomic best linear unbiasedprediction and bayesian regularization neural networks for ge-nomic selectionrdquo IEEE Access vol 7 pp 79199ndash79210 2019

[42] L Wang Y Li and S Lazebnik ldquoLearning deep structure-preserving image-text embeddingsrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern RecognitionIEEE Las Vegas NA USA June 2016

[43] Y Y ZhangD S Zhou et al ldquoSingle-image crowd countingvia multi-column convolutional neural networkrdquo in Pro-ceedings of the Conference on Computer Vision and PatternRecognition pp 589ndash597 Las Vegas NV USA June 2016

[44] Y Xian and Y Tian ldquoSelf-guiding multimodal LSTM-whenwe do not have a perfect training dataset for image cap-tioningrdquo IEEE Transactions on Image Processing vol 28no 11 pp 5241ndash5252 2019

[45] A Karpathy and L Fei-Fei ldquoDeep visual-semantic alignmentsfor generating image descriptionsrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 39 no 4pp 664ndash676 2017

[46] K Papineni S Roukos T Ward and W-J Zhu ldquoBLEUAmethod for automatic evaluation of machine translationrdquo inProceedings of the Conference Association for ComputationalLinguistics Philadelphia PA USA 2002

Complexity 11

Page 2: Image-TextJointLearningforSocialImageswithSpatial ...Sep 27, 2019  · two scenarios are straightforward: given an input image (resp.sentence),thegoalistofindtherelationshipsbetween

to-end training using neural networks which is notcapable of describing concepts unseen in the trainingpairs

(3) -e spatial relations defined by prepositions have tobe learned with the bounding boxes of objects whichare so immoderately challenging to obtain More-over the spatial relationships from the textual de-scriptions are very scarce in reality

Motivated by these observations we aim at developing amethod to learn the spatial relations across separate visualobjects and texts for social image understanding -ereforethis paper proposes a cross-modal framework which buildsa joint model of texts and images to extract features andcombine the advantages of self-attention mechanism anddeep learning models generating interactive effects Inparticular we investigate (1) how to label social images withhigh-level features based on their political image positionand (2) how to combine the text and visual content -eframework is established by taking spatial relationships as abasic unit of image-text joint learning We use neural ar-chitectures to measure the semantic similarity between vi-sual data eg images or regions and text data egsentences or phrases Learning this similarity requiresconnecting low-level pixel values and high-level languagedescriptions and then a joint latent space of standard di-mensionality in which matching image and text features hashigh cosine similarity to explore the semantics hidden in-side a social image

(1) We propose a framework that jointly trains two dualtasks the spatial semantic of image and text-to-image synthesis which improves the supervision andthe generalization performance of social imagecaption

(2) We extend the conventional model by adding thetop-down attention mechanism With a novel tree-based clustering method it can demonstrate theeffectiveness in learning the alignments of visualconcepts of images and the semantics of texts

2 Related Works

-e related works generally fall into two categories imagetagging and relational inference methods

21 Image Tagging Image tagging has been widely studiedfor the past decade Early image tagging models are builtmainly from the view of statistics and probability Inpractice many image annotation works [10 11] assign top-kclass labels to each image the quantities of class labels indifferent photos vary significantly and the top-k annotationsdegrade the performance of image annotation Besidesmany works attempt to infer correlations or joint probabilitydistributions between images and semantic concepts (orkeywords) For example Farhadi et al [12] use detectionmethods to understand scene elements Similarly Li et al[13] start with exposures and piece a final description usingphrases containing detected objects and relationships

Further powerful language models based on languageparsing have been used as well [14] Recently deep learninghas achieved great success in the field of image text andspeech for example the m-RNN model [15] in which amultimodal component is introduced to explicitly connectthe language model and the vision model by a one-layerrepresentation

Images can label at the pixel level which has applicationsin intelligent video monitor and self-driving cars etc Morerecently the variable label number problem has beenidentified [16 17] -ese solutions treat the image anno-tation problem as an image-to-text translation problem andsolve it using an encoder-decoder model -e multiscaleapproaches [18] propose a novel multiscale deep model forextracting rich and discriminative features capable of rep-resenting a wide range of visual concepts Instead of CNNfeatures some works use more semantic information ob-tained from the image as the input to the decoder [19 20] Inall most methods still focus on recognizing objects sepa-rately -e spatial relationships between objects and scenesare always neglected

22 Relational Inference -e earliest reasoning form candate to a symbolic way and the ties between symbols areestablished in logical and mathematical language which isinterpreted in terms of deduction arithmetic and algebraAs symbolic approaches suffer from the symbol groundingproblem they are not robust to small tasks and inputvariations [21 22] Many other methods such as deeplearning in a traditional inference network the inferencepart may be multilayer Perceptron (MLP) Long Short-TermMemory (LSTM) or LSTM with attention which often runsinto the problem of weak data [23 24] Santoro proposed arelation network to achieve the reasoning part -is struc-ture clearly expresses two ideas the final answer has to dowith pairs of objects and the problem itself will influencehow to examine the objects As an active research field somerecent works have also applied neural networks to structuredgraph data or tried to standardize network output throughrelations and knowledge bases We believe that visual datareasoning should be both local and global abandoning thetwo-dimensional image structure to involve the task is notonly useful but also invalid Beyond the object detection taskthat detects relationships for an image the scene graphgeneration task [25 26] endows the model with an entirestructured representation capturing both objects and theirsemantic relationships where the nodes are object instancesin images and the edges depict their pairwise correlations-e proposed methods usually require additional annota-tions of relations while they demand only image-levelannotations

3 Cross-Modal Reasoning Framework

31 Overview of Cross-Modal Tasks and Pipeline In thispaper we focus on two image-text tasks spatial relationmodeling and image-text matching -e former refers bothto image-to-image and image-to-text and definitions of the

2 Complexity

two scenarios are straightforward given an input image(resp sentence) the goal is to find the relationships betweenentities with semantic meaning -e second task refers tofind the best matching sentences to the input images

-e architecture for the above tasks should consist offour-step pipeline as summarized here -e First step is todesign functional feature extractors we use a refined TF-IDFmethod [27] to calculate the word frequency and thencombine it with the embedding vector which helps us tomap it nonlinearly into another vector space to enhancesemantic association in words and lower its dimension -esecond step is to generate full image features and geo-graphical features We can deepen the network continu-ously getting saturated then degrades rapidly [28] -e thirdstep is to add a self-attention mechanism using the imagefeature to get the weight of words -en we combine theimage and text features together to deduce the spatial re-lation of the picture -e final step performs instance rec-ognition which is a deep semantic understanding of socialimages

At a conceptual level there are two branches to achievethe goal One is to train the network to map images and textsinto joint embedding space-e second approach is to framepictures and documents correspondence by cosine simi-larity to obtain the probability that the two items matchAccordingly we define a cross-modal reasoning frameworkto exploit image-text joint learning of social image retrievalwith spatial relation model which includes two variants oftwo-branch networks that follow these two strategies(Figure 1) the embedding network and the similaritynetwork

Two networks are needed in this framework one forcross-modal topic detection and the other for semanticmatching which is trained one by one -e embeddingnetwork refers to cross-domain spatial relation modeling asillustrated in Figure 1(a) -e spatial relation includes bothimage-to-text and image-to-image and the goal is to modelspatial relationships in CNN based detection -e similaritynetwork is illustrated in Figure 1(b) we first pretrain imagedata and text-image data respectively and then fine-tuned onthe target domain training data via reinforcement learning Indetail we use ResNet-50 as the CNN encoder of the frame-work to learn the social image annotations By adding thecross-domain spatial relation model the structure can attendthe crucial parts of the image when decoding different wordsof the caption to generate interactive effects Joint models withimage-text similarity are given by cosine similarity [29]

32 Cross-Domain Spatial Relation Modeling -e spatialrelation between visual content and texts is presented as twolevels First it represented by the matching probability of thepurpose and second by the spatial relationships between theimage objects represented by the interaction of the objects indifferent tasks

321 Spatial Relation between Images and TextsMotivated by attention-based deep methods [30 31] we firstreview a basic attention module called self-attention As

illustrated in Figure 2 the input consists of queries and keysof dimensions dk and dv -e dot product is performedbetween the question and all keys to obtain their similarityA softmax function is applied to obtain the weights on thevalues Given keys and values the output value is theweighted average over input values

Attention(Q K V) softmaxQKT

dk

11139681113888 1113889V

Attention(Query Source) 1113944

Lx

i1Similarity QueryKeyi( 1113857lowastValuei

(1)

In our case we apply modifications to the output valuesSpecially we define attention as follows the source consistsof a set of two-dimensional vectorslt key valuegt given anelement of a target named query calculating the similarity orrelevance of the question and each to get a weight coefficientof every key corresponding to the value-e weighted sum isperformed by

Attention(Query Source) 1113944Lx

i1Similarity QueryKeyi( 1113857

middot Valuei

(2)

where x is the vector of an input image or word Lx is thelength of x Query is the image vector keyiis the word vectorand the dot product is performed to calculate similarity

We apply the perceptron to calculate the weight of theword vector In self-attention each word can compute allterms with attention -e aim is to learn the internal worddependence and to capture the internal structure of the text-e characteristics of self-attention lie in ignoring the dis-tance of sentence to an image directly calculating its innerstructure to study the internal structure of a sentenceFurther the realization of parallel computing can also berelatively simple

Inspired by the thought of learning to rank [32 33] wemeasure the similarity between two samples using cosinedistance

D fxi fxj

1113874 1113875 fxi

fxi

2

middotfxj

fxj

2

(3)

wheremiddot2means L2-norm and (fxi fxj

) isin [minus1 1] for ef-fectively taking two modalities into account and the rankingloss can be written as

Lrank max 0 a minus D fIa fTa( 1113857( 11138571113858 1113859Ianchor

+ max 0 a minus D fIa fTa( 1113857( 11138571113858 1113859Tanchor(4)

where I denotes the visual input T denotes the text inputand α is a margin -en we use the idea of triplet loss [34]which is one of the very widely used similarity functionsTriplet loss function consists of (a) a sample called Anchor(write to xa) that is chosen randomly from the training dataset

Complexity 3

(b) a same-class sample as Anchor called Positive (write to xp)and (c) an example that class is different from Anchor calledNegative (write toxn)-e purpose of triplet loss is tominimizethe distance between xa and xp and maximize the distance

between xa and xn We call this kind of ranking loss the tripletranking loss and the formulation representation is

Lrank xa xp xn1113872 1113873 max 0 α + D xa xp1113872 1113873 minus D xa xn( 11138571113872 11138731113960 1113961

(5)

Key 1

F(QK)

S1

Key 2

F(QK)

S2

Key 3

F(QK)

S3

Key i

F(QK)

Si

Somax()

A1 A2 A3 Ai

Value 1 Value 2 Value 3 Value i

+ + + +Attention value

Query1

2

3lowast lowast lowast lowast

Figure 2 Example to illustrate the process of building an attention model

A rock climber climbs in between two very large rocks hellip

ResNet-50

Embedding network(Cross-domain spatial relationship modeling)

Word2Vec TF-IDF

Word vectorImage vector

Image and wordvector

Self-attention

Spatial relationmodel

Image and word vectorwith spatial relation model

Ranking loss

(a)

Similarity network(Cross-modal matching)

Word vectorImage vector

Kmeans++recursive clustering

Savefinger_prints

ResNet-50

Imagevector

Surfing surfs surfboard

Matching

Tree-based clustering

(b)

Figure 1 -e cross-modal reasoning framework (a) Embedding network (b) Similarity network

4 Complexity

322 Spatial Relation in Image Objects We now describeobject relation computation -e basic idea of the spatialrelationships between image objects have their roots in thecrucial inference a visual or hidden object in the image spacetends to concentrate for relevant purposes but distributerandomly for irrelevant ones Let an object consists of itscoordinate feature fC and appearance feature fA In thiswork fC is a 4-dimensional object bounding box withrelative image coordinates and fA is related to the task

To be specific given input set of N visual or hiddenobjects (fn

C fnA)1113864 1113865

N

n1 the relation feature fR(n) of the wholesets concerning the nth object is computed as

fR(n) 1113944m

ωmnmiddot WV middot f

mA( 1113857 (6)

-e output is a weighted sum of appearance featuresfrom other purposes linearly transformed by WV which iscorresponding to values V in equation (1) -e relationweight ωmn indicates the impact of other objects WV andωmn can be calculated by the object-relations model [35]-ere are two steps First the coordinate features of the twoobjects are embedded in a high-dimensional representationby ResNet Inspired by the widely used bounding box re-gression method DPM [36] the elements are transformedusing log(middot) to calculate distant objects and close-by objectsSecond the embedded feature is turned into a scalar weightand trimmed -e trimming operation restricts relationsonly between objects of individual spatial relationshipswhich is related to the task and knowledge

-e cross-domain spatial relation model has the samedimension of images and texts at the output which can be anessential building block in the framework

33 Similarity Network

331 Image Representation As the gradient vanishingproblem prevents the deep network from being fully learnedwe adopt the ResNet-50 to avoid loss of accuracy by learningresidual functions to reformulate the layers concerning theinputs instead of learning unreferenced functions -ebuilding block is defined as

y f x wi1113864 1113865( 1113857 + x (7)

where x is the input vector and y is the output vector of thelayers considered -e function f is on behalf of the residualmapping to learn

As a building block has two line shortcuts we consideranother option Due to the shortcut connection and ele-ment-wise addition it operates F + x In equation (7) thenumbers of channels of x and y are equal If it is the oppositesituation we adopt the calculation

y f x wi1113864 1113865( 1113857 + wsx (8)

where ws is the operation of convolution to adjust thechannel dimension of x

We learn the image representation vectors from thelower convolutional layer In this manner the decoder canattend to specific parts of an image by selecting a subset of

the feature vectors In this study we employed 2048 featuremaps of the 50-layer residual network

332 Text Representation Prevalent model architecture forNatural Language Processing (NLP) is one-hot codingUnfortunately when the output space grows featurescannot properly span the full feature space consequentlyone-hot encoding might result insufficient for fine-grainedtasks since the projection of the outputs into a higher-di-mensional space dramatically increase the parameter spaceof computed models Also for datasets with a large numberof words the ratio of samples per word is typically reduced

A straightforward way to solve the limitations men-tioned above is to relax the problem into a real-valued linearprogramming problem and then threshold the resultingsolution We combine the vector of word embedding andfrequency by TF-IDF [27] which depicts the occurrencefrequency of a word in all texts not just the number ofoccurrences It allows each word to establish a global contextand transforms the high-dimensional sparse vector into alow-dimensional dense vector Specially we use self-atten-tion as it will enable each word to learn its relation to otherwords

333 Cross-Modal Matching

Cosine Similarity -e angle between cosines can effectivelyavoid the differences of degrees in the same cognition ofindividuals and pay more attention to the differences be-tween dimensions rather than the differences in numericalvalues We use the extracted image feature vector to allocatethe weight of the word vectors rather than getting the weightof the word from the text -en we use the resulting weightand the corresponding word vector to do dot multiplicationwith image vector to get the similarity between image andtext -is allows the semantics of the image and text tointeract

Tree-based Clustering Vector Quantization Algorithm Letx Rd be the d-dimensional instance and y 1 2 q1113864 1113865 bethe class space with a set of q possible classes Note that twotasks are needed in this term tree-based training and re-trieval After obtaining the features of the images and textsthe image and text are in the same vector space and we canuse a scalable K-means++ clustering algorithm [37] to bothimage and text vector

We develop a tree-based algorithm for cross-modallearning which presents a tree-based classificationmodel formulticlass learning A tree-based structure is constructedwhere the root node corresponds to the classes in thetraining dataset Each node v contains a set of k-means++classifiers -e top node contains all the training data At thetop level the whole dataset is partitioned into five datasubsets A B C D E -e instances are recursively par-titioned into smaller subsets while moving down the treeEach internal node contains the training instances of itschild nodes Especially each node in the tree contains twocomponents cluster vectors and predictive class vectors

Complexity 5

-e cluster vector is a vector with real values to measureclusters at a node and we adopt the definition as

pv(n) 1113936xiisinDV

Cin

Dv

11138681113868111386811138681113868111386811138681113868

(9)

wherepv(n) is thenth component ofpv |Dv| is the total numberof instances in node v Ci Yi1 Yi2 Yiq1113966 1113967 isin 0 1 q-e predictive class vector is a vector with boolean values in-dicating themembership of predictive classes at a node-evalueis 1 when pv(n) is larger than the threshold It implies that thenode is the proper class or subclass for instances of node v Notethat Ci and the threshold are obtained from the training process

-e algorithm uses three stopping criteria to determinewhen to stop growing the tree (1) A node is a leaf node ifidentified as a predictive class (2) A node is a leaf node if thedata cannot be partitioned to more than three clusters usingthe classifiers (3) -e tree gets the max depth

Figure 3 shows the flowchart of the training and retrievalsteps We save the tree model we calculate the position ofthe real image or text in the leaf nodes the path is thepictured fingerprint we save all picture fingerprints andthen the tree model is built When matching it is alsonecessary to extract image features first Starting from theroot node each clustering model recursively predicts thecategory of the vector at a different level After the image hasfallen to the leaf nodes output the path of the leaf nodes asthe fingerprint of the picture Use cosine distance to find thesame text with the fingerprint and sort it to get the an-notation -e process is shown in the Algorithms 1 and 2

4 Experiments and Results

41 Datasets and Evaluation We use the Flickr30k andMicrosoft COCO to evaluate our proposed method which iswidely used in caption-image retrieval and image captiongeneration tasks As a popular benchmark for annotationgeneration and retrieval tasks Flickr30k contains 31783images focusing mainly on people and animals and 158915English captions (five per image) We select 30783 imagesrandomly for training and another 1000 for testing splits

Microsoft COCO is the largest existing dataset with bothcaptions and region-level annotations are which consists of82783 training images and 40504 validation images and fivesentences accompany each image We use a total of 82783images for training and testing splits with 5000 images -etesting splits are images with more than three instanceswhich are selected from the validation dataset For eachtesting image we use the publicly available division which iscommonly used in the caption-image retrieval and captiongeneration tasks

42 Implementation Details We use the ResNet-50 to ex-ploit 2048 feature maps with a size of 1times1 in ldquoconv5_3rdquowhich helps us to integrate the high-level features with thelower ones and the visualization results are provided inFigure 4 By adding a full-connection layer of which di-mension is 128 we name the output of graph embedding v1In the text representation the input layer is followed by two

full-connection layers with dimensions d1 and d2 respec-tively -e output of the last full-connection layer is theoutput of text module embedding and we name it v2 Weadd a self-attention mechanism to two embedding networksand calculate to get v1prime v2prime -en v1prime and v2prime are connected to atriplet ranking loss Further we use Adam optimizer to trainthe system

After offline debugging we finally set the parameter ofthe input layer length of the text module to be 100000 Afterthe statistics of the word occurrence number of all sentencesegmentation words we take the word occurrence numberof the first 99999 and the rest words as a new word In totalthe number of neurons of the input layer is 100000

As for the text domain the second neuron number is512 the third is 128 Meanwhile the length of the graphembedding is also 128 We set the margin (in equation (6))and the parameters of Adamrsquos algorithm are as followslr 0001 beta_1 09 beta_2 0999 epsilon 1eminus08decay 00

For speeding up the training we conducted negativesampling -e sample sentence that matches the currentimage is positive To thoroughly train the low-frequencysentences we randomly select the part of the sentenceoutside the sample as negative samples

43 Result of Experiments Microsoft COCO consists of123287 images and 616767 descriptions Five text descrip-tions accompany each image We randomly select thetraining with public 82783 images and remain 5000 imagesas test data We can see that our loss function is slightlybetter than others in Figure 5 when the number of iterationswas over 400 -e Softmax loss optimizes the distance be-tween classes being great but it is weak when it comes tooptimizing the range within categories -e triplet rankingloss addresses this problem to some extent

As is shown in Figure 6 we can observe that the wordssuch as ldquoclimbsrdquo ldquorest atrdquo ldquoin therdquo which show the spatialrelationships between objects get significantly higher scoresthan other words while things like ldquotherdquo and ldquoardquo always havelower matching scores However some prepositions con-taining positional relations such as ldquobyrdquo in the right-uppercorner have higher scores than other prepositions

-e use of attention causes features related to spatialrelation traits to be weighted As is shown in Figure 7 such asthe image in the right-upper corner we can infer from thecoins in the cup next to the person lying down that he isbegging

44 Comparison with the State-of-the-Art

441 Qualitative Evaluation To evaluate the effectivenessof the proposed method we first compare its classificationperformance with state-of-the-art performance reported inthe literature We manually labelled the ground truthmapping from the nouns to the object classes By varying thethreshold the precision-recall curve can be sketched tomeasure accuracy We commonly used ldquoR1rdquo ldquoR5rdquo andldquoR10rdquo ie recall rates at the top 1 5 and 10 results

6 Complexity

Raw images input

Image features frominput image by ResNet

Feature clustering

Stopping criteria

Recurrent clustering

Saving fingerprints

N

Y

(a)

Image features frominput image by ResNet

Class prediction

Leaf node

Fingerprints asoutput

N

Y

Raw image input

Search for thesimilar fingerprint

Order by thecomputed distance

(b)

Figure 3 Flowchart of training and searching steps for tree-based algorithm (a) Flowchart of training steps for tree-based algorithm (b)Flowchart of searching steps for tree-based algorithm

Input M an social imageD the current training instancesN cluster numbersR a set of features to represent an imageFP fingerprints for the imageT cluster tree

Output the fingerprints for M(1) Initialize N R T FP and resize M(2) compute R by ResNet to represent M(3) Repeat(4) create a node v(5) update R according to N clusters to M with k-means++(6) let Dj be the set of data in D satisfying outcome j(7) attach the node returned by T(Dj) to node v(8) until Stop(D R) is true(9) return FP as a finger for the image and attach leaf nodes to FP

ALGORITHM 1 Build fingerprints with a cluster tree

Input M an social imageMD image dataset for searchN number of the image datasetFP fingerprint for the image

Output S top-k similar images for image M(1) Initialize M MD FP S and resize images(2) M and MD is represented as R by ResNet(3) Checking procedure to find the top-k similar images to M in MD(4) For each image from MD get a cosine similarity by FP of the image pair(5) return top-k similar images order by cosine similarity

ALGORITHM 2 Find top-k similar images by fingerprints

Complexity 7

(a)

0

20

40

60

80

0 20 40 60 80 100 120

(b)

Figure 4 -e visualization of ResNet-50 (a) After 50 layers of convolution (b) Fusion by 1 1 ratio

20

25

30

35

40

50

45

0 10 20 30 40 50Iter lowast 10

Reca

ll

Ranking loss

Center lossSomax

Figure 5 Recall curves when training on Microsoft COCO

A ndash020Rock 071Climber 029Climbs 032In ndash012Between 003Two ndash013Very ndash007Large 008Rocks 073

(a)

A ndash015Group 006Of ndash009Mountain 067Climbers 051Rests 054At ndash011e ndash008Summit 018

(b)

Two 003Boys 035In ndash009A ndash012Field 057Kicking 041A ndash012Soccer 022Ball 019

(c)

A ndash017Man 046Surfs 054In ndash012e ndash026Water 021

(d)

Figure 6 Similarity matching between image and text

8 Complexity

As shown in Tables 1 and 2 we compare our proposedmethod with some of the latest and most advanced processeson the Flickr30k andMicrosoft COCO datasets We can findthat our approach works better than the compared methodsDifferent fromDSPE+FVlowast that uses external text corpora tolearn discriminative sentence features our model learnsthem directly from scratch in an end-to-end manner Whencomparing among our methods we can conclude as follows(a) our attention scheme is sufficient since the model withattention consistently outperforms those without notice on

both datasets (b) using ResNet as the underlying network tounderstand the deep semantics of images to get the spatialrelation features with the help of text context relations

It takes 235 seconds to use a brute force algorithm and 18seconds to mark with a tree-based clustering algorithm onMicrosoft COCO datasets

442 Quantitative Evaluation We also report quantitativeevaluation results with the frequently used BLEUmetric [46]for the proposed datasets -e results for Microsoft COCOand Flickr30K datasets are listed in Tables 3 and 4-e imageencoders of all methods listed here are either VGG-Net orResNet which are prevalent in this field We also report theablation test in terms of discarding the spatial relationmodel

-e results demonstrate that our method has a betterperformance than discarding the spatial relation model

(a) (b)

(c) (d)

Figure 7 Examples of attending to the correct object (white indicates the attended regions words below the figure indicate the labelling ofspatial relation features) (a) Havest (b) Begging (c) Search (d) Salvage

Table 1 Results of caption-image retrieval on the Flickr30Kdataset

R1 R5 R10RVP (T+ I) [38] 119 277 477CDCCA [39] 168 393 530MNLM [40] 230 507 629BRNN [41] 222 482 614DSPE+ FVlowast [42] 403 689 799m-RNN [15] 339 651 763m-CNN [43] 346 647 768Softmax-VGG 218 507 617Softmax-ResNet 223 529 632Softmax-ResNet-Attention 356 654 756Ranking-ResNet-Attention 413 691 813lowastUsing external text corpora -e ensemble or multimodel methods

Table 2 Results of caption-image retrieval on theMicrosoft COCOdataset

R1 R5 R10DSPE+ FVlowast 494 761 864m-RNN 417 727 835Sg-LSTM [44] 428 739 847MNLM 439 757 858DVSA [45] 392 699 805m-CNN 431 739 841Softmax-VGG 475 749 843Softmax-ResNet 491 761 852Softmax-ResNet-Attention 503 775 875Ranking-ResNet-Attention 527 793 887lowastUsing external text corpora -e ensemble or multimodel methods

Table 3 -e results of different caption methods on the MicrosoftCOCO

Bleu-1 Bleu-2 Bleu-3 Bleu-4SCA-CNN [19] 0714 0543 0409 0309Semantic ATT [30] 0699 0415 0306 0232SCN-LSTM [20] 0726 0581 0421 0318Softmax-ResNet-Attention 0736 0587 0423 0325Ranking-ResNet-Attention 0741 0598 0438 0319Softmax-ResNet-Attentionlowast 0703 0524 0413 0311Ranking-ResNet-Attentionlowast 0705 0540 0416 0313lowastDiscarding spatial relation model

Table 4 -e results of different caption methods on the Flickr30K

Bleu-1 Bleu-2 Bleu-3 Bleu-4SCA-CNN [19] 0705 0515 0402 0304Semantic ATT [30] 0694 0427 0301 0224SCN-LSTM [20] 0718 0569 0413 0308Softmax-ResNet-Attention 0715 0573 0429 0325Ranking-ResNet-Attention 0729 0581 0416 0319Softmax-ResNet-Attentionlowast 0706 0533 0406 0301Ranking-ResNet-Attentionlowast 0712 0542 0418 0309lowastDiscarding spatial relation model

Complexity 9

Besides our approach is slightly better than the proposedstate-of-the-art methods which verifies the efficiency oftopic-condition in image-captioning Note that our modeluses ResNet-50 as the encoder which is a simple attentionmodel-us our approach is competitive with those models

5 Discussion and Conclusion

-is paper proposes an integrated model to recognize in-stances and objects jointly by leveraging the associatedtextual descriptions and presents a learning algorithm toestimate the model efficiently -e learning process requiresonly separate images and texts without high-level captionsWe use the residual network to deepen the learning of imagesemantic and combine with the text to obtain some hiddenrelation features contained in the picture By constructing ajoint inference module with self-attention we make a fusionof local and global elements We also show that integratingimages and text for deep semantic understanding to label thespatial relation features Furthermore the use of a treeclustering algorithm accelerates the matching process Ex-periments verify that the proposed method achieves com-petitive results on two generic annotation datasets Flickr30kand Microsoft COCO

Data Availability

-e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

-e authors declare no conflicts of interest

Acknowledgments

-e project was sponsored by the National Nature ScienceFoundation of China (41571401 and 41871226) and NaturalScience Foundation of Chongqing China (cstc2019jcyj-msxmX0131)

References

[1] J Zhu L Fang and P Ghamisi ldquoDeformable convolutionalneural networks for hyperspectral image classificationrdquo IEEEGeoscience and Remote Sensing Letters vol 15 no 8pp 1254ndash1258 2018

[2] W Lu J Li T Li W Guo H Zhang and J Guo ldquoWebmultimedia object classification using cross-domain corre-lation knowledgerdquo IEEE Transactions on Multimedia vol 15no 8 pp 1920ndash1929 2013

[3] M Katsurai T Ogawa and M Haseyama ldquoA cross-modalapproach for extracting semantic relationships betweenconcepts using tagged imagesrdquo IEEE Transactions on Mul-timedia vol 16 no 4 pp 1059ndash1074 2014

[4] R Krishna I Chami M Bernstein and L Fei-Fei ldquoReferringrelationshipsrdquo in Proceedings of the IEEECVF Conference onComputer Vision and Pattern Recognition pp 6867ndash6876 SaltLake City UT USA June 2018

[5] B Wang D Lin H Xiong and Y F Zheng ldquoJoint inferenceof objects and scenes with efficient learning of text-object-

scene relationsrdquo IEEE Transactions on Multimedia vol 18no 3 pp 507ndash520 2016

[6] K Fu J Jin R Cui F Sha and C Zhang ldquoAligning where tosee and what to tell image captioning with region-basedattention and scene-specific contextsrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 39 no 12pp 2321ndash2334 2017

[7] Q Wu C Shen L Liu A Dick and A van den HengelldquoWhat value do explicit high level concepts have in vision tolanguage problemsrdquo in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR) pp 203ndash212 Las Vegas NV USA June 2016

[8] M J Choi A Torralba and A S Willsky ldquoA tree-basedcontext model for object recognitionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 34 no 2pp 240ndash252 2012

[9] R Mottaghi X Chen X Liu et al ldquo-e role of context forobject detection and semantic segmentation in the wildrdquo inProceedings of the IEEE Conference on Computer Vision andPattern Recognition Columbus OH USA June 2014

[10] H Hu G-T Zhou Z Deng Z Liao and G Mori ldquoLearningstructured inference neural networks with label relationsrdquo inProceedings of the Computer Vision and Pattern Recognitionpp 2960ndash2968 Las Vegas NV USA June 2016

[11] J Johnson L Ballan and L Fei-Fei ldquoLove thy neighborsimage annotation by exploiting image metadatardquo in Pro-ceedings of the IEEE International Conference on ComputerVision pp 4624ndash4632 Tampa FL USA December 2015

[12] A Farhadi M Hejrati M A Sadeghi et al ldquoEvery picture tellsa story generating sentences from imagesrdquo in Proceedings ofthe European conference on computer vision Springer BerlinHeidelberg 2010

[13] S Li G Kulkarni T L Berg A C Berg and Y ChoildquoComposing simple image descriptions using web-scalen-gramsrdquo in Proceedings of the Conference on ComputationalNatural Language Learning Portland OR USA June 2011

[14] M Mitchell X Han J Dodge et al ldquoGenerating image de-scriptions from computer vision detectionsrdquo in Proceedings ofthe 13th Conference of the European Chapter of the Associationfor Computational Linguistics Avignon France 2012

[15] J Mao W Xu Y Yang J Wang Z Huang and A YuilleldquoDeep captioning with multimodal recurrent neural networks(m-RNN)rdquo in Proceedings of the ICLR pp 1ndash17 San DiegoCA USA May 2015

[16] H Hu G-T Zhou Z Deng Z Liao and G Mori ldquoLearningstructured inference neural networks with label relationsrdquo inProceedings of the CVPR pp 2960ndash2968 Las Vegas NV USAJune 2016

[17] F Liu T Xiang T M Hospedales W Yang and C SunldquoSemantic regularisation for recurrent image annotationrdquo2016 httpsarxivorgabs161105490

[18] Y Niu Z Lu J-R Wen T Xiang and S-F Chang ldquoMulti-modal multi-scale deep learning for large-scale image an-notationrdquo IEEE Transactions on Image Processing vol 28no 4 pp 1720ndash1731 2019

[19] L Chen H Zhang J Xiao et al ldquoSpatial and channel-wiseattention in convolutional networks for image captioningrdquo inProceedings of IEEE Conference Computer Vision and PatternRecognition pp 6298ndash6306 Honolulu Hawaii July 2017

[20] Z Gan ldquoSemantic compositional networks for visual cap-tioningrdquo in Proceedings of IEEE Conference Computer Visionand Pattern Recognition pp 1141ndash1150 Honolulu HI USAJuly 2017

10 Complexity

[21] A Santoro D Raposo D G Barrett et al A Simple NeuralNetwork Module for Relational Reasoning pp 4967-4976

[22] S Harnad ldquo-e symbol grounding problemrdquo Physica DNonlinear Phenomena vol 42 no 1ndash3 pp 335ndash346 1990

[23] M Garnelo K Arulkumaran and M Shanahan ldquoTowardsdeep symbolic reinforcement learningrdquo 2016 httparxivorgabs160905518

[24] B M Lake T D Ullman J B Tenenbaum andS J Gershman ldquoBuilding machines that learn and think likepeoplerdquo Behavioral and Brain Sciences vol 40 2017

[25] D Xu Y Zhu C B Choy and L Fei-Fei ldquoScene graphgeneration by iterative message passingrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognitionpp 3097ndash3106 Honolulu Hawaii July 2017

[26] Y Li W Ouyang B Zhou K Wang and X Wang ldquoScenegraph generation from objects phrases and region captionsrdquoin Proceedings of the IEEE International Conference onComputer Vision pp 1270ndash1279 Venice Italy October 2017

[27] I Yahav O Shehory and D Schwartz ldquoComments miningwith TF-IDF the inherent bias and its removalrdquo IEEETransactions on Knowledge and Data Engineering vol 31no 3 pp 437ndash450 2019

[28] K He X Zhang S Ren et al ldquoDeep residual learning forimage recognitionrdquo in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition pp 770ndash778 LasVegas NA USA July 2016

[29] F Schroff D Kalenichenko and J Philbin ldquoFaceNet a unifiedembedding for face recognition and clusteringrdquo in Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition pp 815ndash823 Boston MA USA 2015

[30] Q You H Jin Z Wang C Fang and J Luo ldquoImage cap-tioning with semantic attentionrdquo in Proceedings of IEEE theConference Computer Vision and Pattern Recognitionpp 4651ndash4659 Las Vegas NV USA June 2016

[31] D Bahdanau K Cho and Y Bengio ldquoNeural Machinetranslation by jointly learning to align and translaterdquo 2014httparxivorgabs14090473

[32] A Karpathy A Joulin and L F Fei-Fei ldquoDeep fragmentembeddings for bidirectional image sentence mappingrdquo inProceedings of the International Conference on Neural Infor-mation Processing Systems pp 1889ndash1897 Montreal Canada2014

[33] H Nam J Ha and J Kim ldquoDual attention networks formultimodal reasoning and matchingrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognitionpp 2156ndash2164 Honolulu HI USA July 2017

[34] P Young A Lai M Hodosh and J Hockenmaier ldquoFromimage descriptions to visual denotations new similaritymetrics for semantic inference over event descriptionsrdquoTransactions of the Association for Computational Linguisticsvol 2 pp 67ndash78 2014

[35] H Hu J Gu Z Zhang J Dai and Y Wei ldquoRelation networksfor object detectionrdquo in Proceedings of the IEEECVF Con-ference on Computer Vision and Pattern Recognitionpp 3588ndash3597 Salt Lake City UT USA June 2018

[36] P Felzenszwalb R Girshick D McAllester and D RamananldquoObject detection with discriminatively trained part basedmodelsrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 32 no 9 2010

[37] B Bahmani ldquoScalable K-Means++rdquo in Proceedings of the VldbEndowment 5(7) pp 622ndash633 Istanbul Turkey August 2012

[38] R Kiros R Salakhutdinov and R S Zemel ldquoUnifying visual-semantic embeddings with multimodal neural languagemodelsrdquo 2014 httparxivorgabs14112539

[39] Y Yu S Tang K Aizawa and A Aizawa ldquoCategory-baseddeep CCA for fine-grained venue discovery from multimodaldatardquo IEEE Transactions on Neural Networks and LearningSystems vol 30 no 4 pp 1250ndash1258 2019

[40] J Johnson A Alahi and L Fei-Fei ldquoPerceptual losses for real-time style transfer and super-resolutionrdquo Computer Vision-ECCV 2016 pp 694ndash711 2016

[41] M Sinecen ldquoComparison of genomic best linear unbiasedprediction and bayesian regularization neural networks for ge-nomic selectionrdquo IEEE Access vol 7 pp 79199ndash79210 2019

[42] L Wang Y Li and S Lazebnik ldquoLearning deep structure-preserving image-text embeddingsrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern RecognitionIEEE Las Vegas NA USA June 2016

[43] Y Y ZhangD S Zhou et al ldquoSingle-image crowd countingvia multi-column convolutional neural networkrdquo in Pro-ceedings of the Conference on Computer Vision and PatternRecognition pp 589ndash597 Las Vegas NV USA June 2016

[44] Y Xian and Y Tian ldquoSelf-guiding multimodal LSTM-whenwe do not have a perfect training dataset for image cap-tioningrdquo IEEE Transactions on Image Processing vol 28no 11 pp 5241ndash5252 2019

[45] A Karpathy and L Fei-Fei ldquoDeep visual-semantic alignmentsfor generating image descriptionsrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 39 no 4pp 664ndash676 2017

[46] K Papineni S Roukos T Ward and W-J Zhu ldquoBLEUAmethod for automatic evaluation of machine translationrdquo inProceedings of the Conference Association for ComputationalLinguistics Philadelphia PA USA 2002

Complexity 11

Page 3: Image-TextJointLearningforSocialImageswithSpatial ...Sep 27, 2019  · two scenarios are straightforward: given an input image (resp.sentence),thegoalistofindtherelationshipsbetween

two scenarios are straightforward given an input image(resp sentence) the goal is to find the relationships betweenentities with semantic meaning -e second task refers tofind the best matching sentences to the input images

-e architecture for the above tasks should consist offour-step pipeline as summarized here -e First step is todesign functional feature extractors we use a refined TF-IDFmethod [27] to calculate the word frequency and thencombine it with the embedding vector which helps us tomap it nonlinearly into another vector space to enhancesemantic association in words and lower its dimension -esecond step is to generate full image features and geo-graphical features We can deepen the network continu-ously getting saturated then degrades rapidly [28] -e thirdstep is to add a self-attention mechanism using the imagefeature to get the weight of words -en we combine theimage and text features together to deduce the spatial re-lation of the picture -e final step performs instance rec-ognition which is a deep semantic understanding of socialimages

At a conceptual level there are two branches to achievethe goal One is to train the network to map images and textsinto joint embedding space-e second approach is to framepictures and documents correspondence by cosine simi-larity to obtain the probability that the two items matchAccordingly we define a cross-modal reasoning frameworkto exploit image-text joint learning of social image retrievalwith spatial relation model which includes two variants oftwo-branch networks that follow these two strategies(Figure 1) the embedding network and the similaritynetwork

Two networks are needed in this framework one forcross-modal topic detection and the other for semanticmatching which is trained one by one -e embeddingnetwork refers to cross-domain spatial relation modeling asillustrated in Figure 1(a) -e spatial relation includes bothimage-to-text and image-to-image and the goal is to modelspatial relationships in CNN based detection -e similaritynetwork is illustrated in Figure 1(b) we first pretrain imagedata and text-image data respectively and then fine-tuned onthe target domain training data via reinforcement learning Indetail we use ResNet-50 as the CNN encoder of the frame-work to learn the social image annotations By adding thecross-domain spatial relation model the structure can attendthe crucial parts of the image when decoding different wordsof the caption to generate interactive effects Joint models withimage-text similarity are given by cosine similarity [29]

32 Cross-Domain Spatial Relation Modeling -e spatialrelation between visual content and texts is presented as twolevels First it represented by the matching probability of thepurpose and second by the spatial relationships between theimage objects represented by the interaction of the objects indifferent tasks

321 Spatial Relation between Images and TextsMotivated by attention-based deep methods [30 31] we firstreview a basic attention module called self-attention As

illustrated in Figure 2 the input consists of queries and keysof dimensions dk and dv -e dot product is performedbetween the question and all keys to obtain their similarityA softmax function is applied to obtain the weights on thevalues Given keys and values the output value is theweighted average over input values

Attention(Q K V) softmaxQKT

dk

11139681113888 1113889V

Attention(Query Source) 1113944

Lx

i1Similarity QueryKeyi( 1113857lowastValuei

(1)

In our case we apply modifications to the output valuesSpecially we define attention as follows the source consistsof a set of two-dimensional vectorslt key valuegt given anelement of a target named query calculating the similarity orrelevance of the question and each to get a weight coefficientof every key corresponding to the value-e weighted sum isperformed by

Attention(Query Source) 1113944Lx

i1Similarity QueryKeyi( 1113857

middot Valuei

(2)

where x is the vector of an input image or word Lx is thelength of x Query is the image vector keyiis the word vectorand the dot product is performed to calculate similarity

We apply the perceptron to calculate the weight of theword vector In self-attention each word can compute allterms with attention -e aim is to learn the internal worddependence and to capture the internal structure of the text-e characteristics of self-attention lie in ignoring the dis-tance of sentence to an image directly calculating its innerstructure to study the internal structure of a sentenceFurther the realization of parallel computing can also berelatively simple

Inspired by the thought of learning to rank [32 33] wemeasure the similarity between two samples using cosinedistance

D fxi fxj

1113874 1113875 fxi

fxi

2

middotfxj

fxj

2

(3)

wheremiddot2means L2-norm and (fxi fxj

) isin [minus1 1] for ef-fectively taking two modalities into account and the rankingloss can be written as

Lrank max 0 a minus D fIa fTa( 1113857( 11138571113858 1113859Ianchor

+ max 0 a minus D fIa fTa( 1113857( 11138571113858 1113859Tanchor(4)

where I denotes the visual input T denotes the text inputand α is a margin -en we use the idea of triplet loss [34]which is one of the very widely used similarity functionsTriplet loss function consists of (a) a sample called Anchor(write to xa) that is chosen randomly from the training dataset

Complexity 3

(b) a same-class sample as Anchor called Positive (write to xp)and (c) an example that class is different from Anchor calledNegative (write toxn)-e purpose of triplet loss is tominimizethe distance between xa and xp and maximize the distance

between xa and xn We call this kind of ranking loss the tripletranking loss and the formulation representation is

Lrank xa xp xn1113872 1113873 max 0 α + D xa xp1113872 1113873 minus D xa xn( 11138571113872 11138731113960 1113961

(5)

Key 1

F(QK)

S1

Key 2

F(QK)

S2

Key 3

F(QK)

S3

Key i

F(QK)

Si

Somax()

A1 A2 A3 Ai

Value 1 Value 2 Value 3 Value i

+ + + +Attention value

Query1

2

3lowast lowast lowast lowast

Figure 2 Example to illustrate the process of building an attention model

A rock climber climbs in between two very large rocks hellip

ResNet-50

Embedding network(Cross-domain spatial relationship modeling)

Word2Vec TF-IDF

Word vectorImage vector

Image and wordvector

Self-attention

Spatial relationmodel

Image and word vectorwith spatial relation model

Ranking loss

(a)

Similarity network(Cross-modal matching)

Word vectorImage vector

Kmeans++recursive clustering

Savefinger_prints

ResNet-50

Imagevector

Surfing surfs surfboard

Matching

Tree-based clustering

(b)

Figure 1 -e cross-modal reasoning framework (a) Embedding network (b) Similarity network

4 Complexity

322 Spatial Relation in Image Objects We now describeobject relation computation -e basic idea of the spatialrelationships between image objects have their roots in thecrucial inference a visual or hidden object in the image spacetends to concentrate for relevant purposes but distributerandomly for irrelevant ones Let an object consists of itscoordinate feature fC and appearance feature fA In thiswork fC is a 4-dimensional object bounding box withrelative image coordinates and fA is related to the task

To be specific given input set of N visual or hiddenobjects (fn

C fnA)1113864 1113865

N

n1 the relation feature fR(n) of the wholesets concerning the nth object is computed as

fR(n) 1113944m

ωmnmiddot WV middot f

mA( 1113857 (6)

-e output is a weighted sum of appearance featuresfrom other purposes linearly transformed by WV which iscorresponding to values V in equation (1) -e relationweight ωmn indicates the impact of other objects WV andωmn can be calculated by the object-relations model [35]-ere are two steps First the coordinate features of the twoobjects are embedded in a high-dimensional representationby ResNet Inspired by the widely used bounding box re-gression method DPM [36] the elements are transformedusing log(middot) to calculate distant objects and close-by objectsSecond the embedded feature is turned into a scalar weightand trimmed -e trimming operation restricts relationsonly between objects of individual spatial relationshipswhich is related to the task and knowledge

-e cross-domain spatial relation model has the samedimension of images and texts at the output which can be anessential building block in the framework

33 Similarity Network

331 Image Representation As the gradient vanishingproblem prevents the deep network from being fully learnedwe adopt the ResNet-50 to avoid loss of accuracy by learningresidual functions to reformulate the layers concerning theinputs instead of learning unreferenced functions -ebuilding block is defined as

y f x wi1113864 1113865( 1113857 + x (7)

where x is the input vector and y is the output vector of thelayers considered -e function f is on behalf of the residualmapping to learn

As a building block has two line shortcuts we consideranother option Due to the shortcut connection and ele-ment-wise addition it operates F + x In equation (7) thenumbers of channels of x and y are equal If it is the oppositesituation we adopt the calculation

y f x wi1113864 1113865( 1113857 + wsx (8)

where ws is the operation of convolution to adjust thechannel dimension of x

We learn the image representation vectors from thelower convolutional layer In this manner the decoder canattend to specific parts of an image by selecting a subset of

the feature vectors In this study we employed 2048 featuremaps of the 50-layer residual network

332 Text Representation Prevalent model architecture forNatural Language Processing (NLP) is one-hot codingUnfortunately when the output space grows featurescannot properly span the full feature space consequentlyone-hot encoding might result insufficient for fine-grainedtasks since the projection of the outputs into a higher-di-mensional space dramatically increase the parameter spaceof computed models Also for datasets with a large numberof words the ratio of samples per word is typically reduced

A straightforward way to solve the limitations men-tioned above is to relax the problem into a real-valued linearprogramming problem and then threshold the resultingsolution We combine the vector of word embedding andfrequency by TF-IDF [27] which depicts the occurrencefrequency of a word in all texts not just the number ofoccurrences It allows each word to establish a global contextand transforms the high-dimensional sparse vector into alow-dimensional dense vector Specially we use self-atten-tion as it will enable each word to learn its relation to otherwords

333 Cross-Modal Matching

Cosine Similarity -e angle between cosines can effectivelyavoid the differences of degrees in the same cognition ofindividuals and pay more attention to the differences be-tween dimensions rather than the differences in numericalvalues We use the extracted image feature vector to allocatethe weight of the word vectors rather than getting the weightof the word from the text -en we use the resulting weightand the corresponding word vector to do dot multiplicationwith image vector to get the similarity between image andtext -is allows the semantics of the image and text tointeract

Tree-based Clustering Vector Quantization Algorithm Letx Rd be the d-dimensional instance and y 1 2 q1113864 1113865 bethe class space with a set of q possible classes Note that twotasks are needed in this term tree-based training and re-trieval After obtaining the features of the images and textsthe image and text are in the same vector space and we canuse a scalable K-means++ clustering algorithm [37] to bothimage and text vector

We develop a tree-based algorithm for cross-modallearning which presents a tree-based classificationmodel formulticlass learning A tree-based structure is constructedwhere the root node corresponds to the classes in thetraining dataset Each node v contains a set of k-means++classifiers -e top node contains all the training data At thetop level the whole dataset is partitioned into five datasubsets A B C D E -e instances are recursively par-titioned into smaller subsets while moving down the treeEach internal node contains the training instances of itschild nodes Especially each node in the tree contains twocomponents cluster vectors and predictive class vectors

Complexity 5

-e cluster vector is a vector with real values to measureclusters at a node and we adopt the definition as

pv(n) 1113936xiisinDV

Cin

Dv

11138681113868111386811138681113868111386811138681113868

(9)

wherepv(n) is thenth component ofpv |Dv| is the total numberof instances in node v Ci Yi1 Yi2 Yiq1113966 1113967 isin 0 1 q-e predictive class vector is a vector with boolean values in-dicating themembership of predictive classes at a node-evalueis 1 when pv(n) is larger than the threshold It implies that thenode is the proper class or subclass for instances of node v Notethat Ci and the threshold are obtained from the training process

-e algorithm uses three stopping criteria to determinewhen to stop growing the tree (1) A node is a leaf node ifidentified as a predictive class (2) A node is a leaf node if thedata cannot be partitioned to more than three clusters usingthe classifiers (3) -e tree gets the max depth

Figure 3 shows the flowchart of the training and retrievalsteps We save the tree model we calculate the position ofthe real image or text in the leaf nodes the path is thepictured fingerprint we save all picture fingerprints andthen the tree model is built When matching it is alsonecessary to extract image features first Starting from theroot node each clustering model recursively predicts thecategory of the vector at a different level After the image hasfallen to the leaf nodes output the path of the leaf nodes asthe fingerprint of the picture Use cosine distance to find thesame text with the fingerprint and sort it to get the an-notation -e process is shown in the Algorithms 1 and 2

4 Experiments and Results

41 Datasets and Evaluation We use the Flickr30k andMicrosoft COCO to evaluate our proposed method which iswidely used in caption-image retrieval and image captiongeneration tasks As a popular benchmark for annotationgeneration and retrieval tasks Flickr30k contains 31783images focusing mainly on people and animals and 158915English captions (five per image) We select 30783 imagesrandomly for training and another 1000 for testing splits

Microsoft COCO is the largest existing dataset with bothcaptions and region-level annotations are which consists of82783 training images and 40504 validation images and fivesentences accompany each image We use a total of 82783images for training and testing splits with 5000 images -etesting splits are images with more than three instanceswhich are selected from the validation dataset For eachtesting image we use the publicly available division which iscommonly used in the caption-image retrieval and captiongeneration tasks

42 Implementation Details We use the ResNet-50 to ex-ploit 2048 feature maps with a size of 1times1 in ldquoconv5_3rdquowhich helps us to integrate the high-level features with thelower ones and the visualization results are provided inFigure 4 By adding a full-connection layer of which di-mension is 128 we name the output of graph embedding v1In the text representation the input layer is followed by two

full-connection layers with dimensions d1 and d2 respec-tively -e output of the last full-connection layer is theoutput of text module embedding and we name it v2 Weadd a self-attention mechanism to two embedding networksand calculate to get v1prime v2prime -en v1prime and v2prime are connected to atriplet ranking loss Further we use Adam optimizer to trainthe system

After offline debugging we finally set the parameter ofthe input layer length of the text module to be 100000 Afterthe statistics of the word occurrence number of all sentencesegmentation words we take the word occurrence numberof the first 99999 and the rest words as a new word In totalthe number of neurons of the input layer is 100000

As for the text domain the second neuron number is512 the third is 128 Meanwhile the length of the graphembedding is also 128 We set the margin (in equation (6))and the parameters of Adamrsquos algorithm are as followslr 0001 beta_1 09 beta_2 0999 epsilon 1eminus08decay 00

For speeding up the training we conducted negativesampling -e sample sentence that matches the currentimage is positive To thoroughly train the low-frequencysentences we randomly select the part of the sentenceoutside the sample as negative samples

43 Result of Experiments Microsoft COCO consists of123287 images and 616767 descriptions Five text descrip-tions accompany each image We randomly select thetraining with public 82783 images and remain 5000 imagesas test data We can see that our loss function is slightlybetter than others in Figure 5 when the number of iterationswas over 400 -e Softmax loss optimizes the distance be-tween classes being great but it is weak when it comes tooptimizing the range within categories -e triplet rankingloss addresses this problem to some extent

As is shown in Figure 6 we can observe that the wordssuch as ldquoclimbsrdquo ldquorest atrdquo ldquoin therdquo which show the spatialrelationships between objects get significantly higher scoresthan other words while things like ldquotherdquo and ldquoardquo always havelower matching scores However some prepositions con-taining positional relations such as ldquobyrdquo in the right-uppercorner have higher scores than other prepositions

-e use of attention causes features related to spatialrelation traits to be weighted As is shown in Figure 7 such asthe image in the right-upper corner we can infer from thecoins in the cup next to the person lying down that he isbegging

44 Comparison with the State-of-the-Art

441 Qualitative Evaluation To evaluate the effectivenessof the proposed method we first compare its classificationperformance with state-of-the-art performance reported inthe literature We manually labelled the ground truthmapping from the nouns to the object classes By varying thethreshold the precision-recall curve can be sketched tomeasure accuracy We commonly used ldquoR1rdquo ldquoR5rdquo andldquoR10rdquo ie recall rates at the top 1 5 and 10 results

6 Complexity

Raw images input

Image features frominput image by ResNet

Feature clustering

Stopping criteria

Recurrent clustering

Saving fingerprints

N

Y

(a)

Image features frominput image by ResNet

Class prediction

Leaf node

Fingerprints asoutput

N

Y

Raw image input

Search for thesimilar fingerprint

Order by thecomputed distance

(b)

Figure 3 Flowchart of training and searching steps for tree-based algorithm (a) Flowchart of training steps for tree-based algorithm (b)Flowchart of searching steps for tree-based algorithm

Input M an social imageD the current training instancesN cluster numbersR a set of features to represent an imageFP fingerprints for the imageT cluster tree

Output the fingerprints for M(1) Initialize N R T FP and resize M(2) compute R by ResNet to represent M(3) Repeat(4) create a node v(5) update R according to N clusters to M with k-means++(6) let Dj be the set of data in D satisfying outcome j(7) attach the node returned by T(Dj) to node v(8) until Stop(D R) is true(9) return FP as a finger for the image and attach leaf nodes to FP

ALGORITHM 1 Build fingerprints with a cluster tree

Input M an social imageMD image dataset for searchN number of the image datasetFP fingerprint for the image

Output S top-k similar images for image M(1) Initialize M MD FP S and resize images(2) M and MD is represented as R by ResNet(3) Checking procedure to find the top-k similar images to M in MD(4) For each image from MD get a cosine similarity by FP of the image pair(5) return top-k similar images order by cosine similarity

ALGORITHM 2 Find top-k similar images by fingerprints

Complexity 7

(a)

0

20

40

60

80

0 20 40 60 80 100 120

(b)

Figure 4 -e visualization of ResNet-50 (a) After 50 layers of convolution (b) Fusion by 1 1 ratio

20

25

30

35

40

50

45

0 10 20 30 40 50Iter lowast 10

Reca

ll

Ranking loss

Center lossSomax

Figure 5 Recall curves when training on Microsoft COCO

A ndash020Rock 071Climber 029Climbs 032In ndash012Between 003Two ndash013Very ndash007Large 008Rocks 073

(a)

A ndash015Group 006Of ndash009Mountain 067Climbers 051Rests 054At ndash011e ndash008Summit 018

(b)

Two 003Boys 035In ndash009A ndash012Field 057Kicking 041A ndash012Soccer 022Ball 019

(c)

A ndash017Man 046Surfs 054In ndash012e ndash026Water 021

(d)

Figure 6 Similarity matching between image and text

8 Complexity

As shown in Tables 1 and 2 we compare our proposedmethod with some of the latest and most advanced processeson the Flickr30k andMicrosoft COCO datasets We can findthat our approach works better than the compared methodsDifferent fromDSPE+FVlowast that uses external text corpora tolearn discriminative sentence features our model learnsthem directly from scratch in an end-to-end manner Whencomparing among our methods we can conclude as follows(a) our attention scheme is sufficient since the model withattention consistently outperforms those without notice on

both datasets (b) using ResNet as the underlying network tounderstand the deep semantics of images to get the spatialrelation features with the help of text context relations

It takes 235 seconds to use a brute force algorithm and 18seconds to mark with a tree-based clustering algorithm onMicrosoft COCO datasets

442 Quantitative Evaluation We also report quantitativeevaluation results with the frequently used BLEUmetric [46]for the proposed datasets -e results for Microsoft COCOand Flickr30K datasets are listed in Tables 3 and 4-e imageencoders of all methods listed here are either VGG-Net orResNet which are prevalent in this field We also report theablation test in terms of discarding the spatial relationmodel

-e results demonstrate that our method has a betterperformance than discarding the spatial relation model

(a) (b)

(c) (d)

Figure 7 Examples of attending to the correct object (white indicates the attended regions words below the figure indicate the labelling ofspatial relation features) (a) Havest (b) Begging (c) Search (d) Salvage

Table 1 Results of caption-image retrieval on the Flickr30Kdataset

R1 R5 R10RVP (T+ I) [38] 119 277 477CDCCA [39] 168 393 530MNLM [40] 230 507 629BRNN [41] 222 482 614DSPE+ FVlowast [42] 403 689 799m-RNN [15] 339 651 763m-CNN [43] 346 647 768Softmax-VGG 218 507 617Softmax-ResNet 223 529 632Softmax-ResNet-Attention 356 654 756Ranking-ResNet-Attention 413 691 813lowastUsing external text corpora -e ensemble or multimodel methods

Table 2 Results of caption-image retrieval on theMicrosoft COCOdataset

R1 R5 R10DSPE+ FVlowast 494 761 864m-RNN 417 727 835Sg-LSTM [44] 428 739 847MNLM 439 757 858DVSA [45] 392 699 805m-CNN 431 739 841Softmax-VGG 475 749 843Softmax-ResNet 491 761 852Softmax-ResNet-Attention 503 775 875Ranking-ResNet-Attention 527 793 887lowastUsing external text corpora -e ensemble or multimodel methods

Table 3 -e results of different caption methods on the MicrosoftCOCO

Bleu-1 Bleu-2 Bleu-3 Bleu-4SCA-CNN [19] 0714 0543 0409 0309Semantic ATT [30] 0699 0415 0306 0232SCN-LSTM [20] 0726 0581 0421 0318Softmax-ResNet-Attention 0736 0587 0423 0325Ranking-ResNet-Attention 0741 0598 0438 0319Softmax-ResNet-Attentionlowast 0703 0524 0413 0311Ranking-ResNet-Attentionlowast 0705 0540 0416 0313lowastDiscarding spatial relation model

Table 4 -e results of different caption methods on the Flickr30K

Bleu-1 Bleu-2 Bleu-3 Bleu-4SCA-CNN [19] 0705 0515 0402 0304Semantic ATT [30] 0694 0427 0301 0224SCN-LSTM [20] 0718 0569 0413 0308Softmax-ResNet-Attention 0715 0573 0429 0325Ranking-ResNet-Attention 0729 0581 0416 0319Softmax-ResNet-Attentionlowast 0706 0533 0406 0301Ranking-ResNet-Attentionlowast 0712 0542 0418 0309lowastDiscarding spatial relation model

Complexity 9

Besides our approach is slightly better than the proposedstate-of-the-art methods which verifies the efficiency oftopic-condition in image-captioning Note that our modeluses ResNet-50 as the encoder which is a simple attentionmodel-us our approach is competitive with those models

5 Discussion and Conclusion

-is paper proposes an integrated model to recognize in-stances and objects jointly by leveraging the associatedtextual descriptions and presents a learning algorithm toestimate the model efficiently -e learning process requiresonly separate images and texts without high-level captionsWe use the residual network to deepen the learning of imagesemantic and combine with the text to obtain some hiddenrelation features contained in the picture By constructing ajoint inference module with self-attention we make a fusionof local and global elements We also show that integratingimages and text for deep semantic understanding to label thespatial relation features Furthermore the use of a treeclustering algorithm accelerates the matching process Ex-periments verify that the proposed method achieves com-petitive results on two generic annotation datasets Flickr30kand Microsoft COCO

Data Availability

-e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

-e authors declare no conflicts of interest

Acknowledgments

-e project was sponsored by the National Nature ScienceFoundation of China (41571401 and 41871226) and NaturalScience Foundation of Chongqing China (cstc2019jcyj-msxmX0131)

References

[1] J Zhu L Fang and P Ghamisi ldquoDeformable convolutionalneural networks for hyperspectral image classificationrdquo IEEEGeoscience and Remote Sensing Letters vol 15 no 8pp 1254ndash1258 2018

[2] W Lu J Li T Li W Guo H Zhang and J Guo ldquoWebmultimedia object classification using cross-domain corre-lation knowledgerdquo IEEE Transactions on Multimedia vol 15no 8 pp 1920ndash1929 2013

[3] M Katsurai T Ogawa and M Haseyama ldquoA cross-modalapproach for extracting semantic relationships betweenconcepts using tagged imagesrdquo IEEE Transactions on Mul-timedia vol 16 no 4 pp 1059ndash1074 2014

[4] R Krishna I Chami M Bernstein and L Fei-Fei ldquoReferringrelationshipsrdquo in Proceedings of the IEEECVF Conference onComputer Vision and Pattern Recognition pp 6867ndash6876 SaltLake City UT USA June 2018

[5] B Wang D Lin H Xiong and Y F Zheng ldquoJoint inferenceof objects and scenes with efficient learning of text-object-

scene relationsrdquo IEEE Transactions on Multimedia vol 18no 3 pp 507ndash520 2016

[6] K Fu J Jin R Cui F Sha and C Zhang ldquoAligning where tosee and what to tell image captioning with region-basedattention and scene-specific contextsrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 39 no 12pp 2321ndash2334 2017

[7] Q Wu C Shen L Liu A Dick and A van den HengelldquoWhat value do explicit high level concepts have in vision tolanguage problemsrdquo in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR) pp 203ndash212 Las Vegas NV USA June 2016

[8] M J Choi A Torralba and A S Willsky ldquoA tree-basedcontext model for object recognitionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 34 no 2pp 240ndash252 2012

[9] R Mottaghi X Chen X Liu et al ldquo-e role of context forobject detection and semantic segmentation in the wildrdquo inProceedings of the IEEE Conference on Computer Vision andPattern Recognition Columbus OH USA June 2014

[10] H Hu G-T Zhou Z Deng Z Liao and G Mori ldquoLearningstructured inference neural networks with label relationsrdquo inProceedings of the Computer Vision and Pattern Recognitionpp 2960ndash2968 Las Vegas NV USA June 2016

[11] J Johnson L Ballan and L Fei-Fei ldquoLove thy neighborsimage annotation by exploiting image metadatardquo in Pro-ceedings of the IEEE International Conference on ComputerVision pp 4624ndash4632 Tampa FL USA December 2015

[12] A Farhadi M Hejrati M A Sadeghi et al ldquoEvery picture tellsa story generating sentences from imagesrdquo in Proceedings ofthe European conference on computer vision Springer BerlinHeidelberg 2010

[13] S Li G Kulkarni T L Berg A C Berg and Y ChoildquoComposing simple image descriptions using web-scalen-gramsrdquo in Proceedings of the Conference on ComputationalNatural Language Learning Portland OR USA June 2011

[14] M Mitchell X Han J Dodge et al ldquoGenerating image de-scriptions from computer vision detectionsrdquo in Proceedings ofthe 13th Conference of the European Chapter of the Associationfor Computational Linguistics Avignon France 2012

[15] J Mao W Xu Y Yang J Wang Z Huang and A YuilleldquoDeep captioning with multimodal recurrent neural networks(m-RNN)rdquo in Proceedings of the ICLR pp 1ndash17 San DiegoCA USA May 2015

[16] H Hu G-T Zhou Z Deng Z Liao and G Mori ldquoLearningstructured inference neural networks with label relationsrdquo inProceedings of the CVPR pp 2960ndash2968 Las Vegas NV USAJune 2016

[17] F Liu T Xiang T M Hospedales W Yang and C SunldquoSemantic regularisation for recurrent image annotationrdquo2016 httpsarxivorgabs161105490

[18] Y Niu Z Lu J-R Wen T Xiang and S-F Chang ldquoMulti-modal multi-scale deep learning for large-scale image an-notationrdquo IEEE Transactions on Image Processing vol 28no 4 pp 1720ndash1731 2019

[19] L Chen H Zhang J Xiao et al ldquoSpatial and channel-wiseattention in convolutional networks for image captioningrdquo inProceedings of IEEE Conference Computer Vision and PatternRecognition pp 6298ndash6306 Honolulu Hawaii July 2017

[20] Z Gan ldquoSemantic compositional networks for visual cap-tioningrdquo in Proceedings of IEEE Conference Computer Visionand Pattern Recognition pp 1141ndash1150 Honolulu HI USAJuly 2017

10 Complexity

[21] A Santoro D Raposo D G Barrett et al A Simple NeuralNetwork Module for Relational Reasoning pp 4967-4976

[22] S Harnad ldquo-e symbol grounding problemrdquo Physica DNonlinear Phenomena vol 42 no 1ndash3 pp 335ndash346 1990

[23] M Garnelo K Arulkumaran and M Shanahan ldquoTowardsdeep symbolic reinforcement learningrdquo 2016 httparxivorgabs160905518

[24] B M Lake T D Ullman J B Tenenbaum andS J Gershman ldquoBuilding machines that learn and think likepeoplerdquo Behavioral and Brain Sciences vol 40 2017

[25] D Xu Y Zhu C B Choy and L Fei-Fei ldquoScene graphgeneration by iterative message passingrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognitionpp 3097ndash3106 Honolulu Hawaii July 2017

[26] Y Li W Ouyang B Zhou K Wang and X Wang ldquoScenegraph generation from objects phrases and region captionsrdquoin Proceedings of the IEEE International Conference onComputer Vision pp 1270ndash1279 Venice Italy October 2017

[27] I Yahav O Shehory and D Schwartz ldquoComments miningwith TF-IDF the inherent bias and its removalrdquo IEEETransactions on Knowledge and Data Engineering vol 31no 3 pp 437ndash450 2019

[28] K He X Zhang S Ren et al ldquoDeep residual learning forimage recognitionrdquo in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition pp 770ndash778 LasVegas NA USA July 2016

[29] F Schroff D Kalenichenko and J Philbin ldquoFaceNet a unifiedembedding for face recognition and clusteringrdquo in Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition pp 815ndash823 Boston MA USA 2015

[30] Q You H Jin Z Wang C Fang and J Luo ldquoImage cap-tioning with semantic attentionrdquo in Proceedings of IEEE theConference Computer Vision and Pattern Recognitionpp 4651ndash4659 Las Vegas NV USA June 2016

[31] D Bahdanau K Cho and Y Bengio ldquoNeural Machinetranslation by jointly learning to align and translaterdquo 2014httparxivorgabs14090473

[32] A Karpathy A Joulin and L F Fei-Fei ldquoDeep fragmentembeddings for bidirectional image sentence mappingrdquo inProceedings of the International Conference on Neural Infor-mation Processing Systems pp 1889ndash1897 Montreal Canada2014

[33] H Nam J Ha and J Kim ldquoDual attention networks formultimodal reasoning and matchingrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognitionpp 2156ndash2164 Honolulu HI USA July 2017

[34] P Young A Lai M Hodosh and J Hockenmaier ldquoFromimage descriptions to visual denotations new similaritymetrics for semantic inference over event descriptionsrdquoTransactions of the Association for Computational Linguisticsvol 2 pp 67ndash78 2014

[35] H Hu J Gu Z Zhang J Dai and Y Wei ldquoRelation networksfor object detectionrdquo in Proceedings of the IEEECVF Con-ference on Computer Vision and Pattern Recognitionpp 3588ndash3597 Salt Lake City UT USA June 2018

[36] P Felzenszwalb R Girshick D McAllester and D RamananldquoObject detection with discriminatively trained part basedmodelsrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 32 no 9 2010

[37] B Bahmani ldquoScalable K-Means++rdquo in Proceedings of the VldbEndowment 5(7) pp 622ndash633 Istanbul Turkey August 2012

[38] R Kiros R Salakhutdinov and R S Zemel ldquoUnifying visual-semantic embeddings with multimodal neural languagemodelsrdquo 2014 httparxivorgabs14112539

[39] Y Yu S Tang K Aizawa and A Aizawa ldquoCategory-baseddeep CCA for fine-grained venue discovery from multimodaldatardquo IEEE Transactions on Neural Networks and LearningSystems vol 30 no 4 pp 1250ndash1258 2019

[40] J Johnson A Alahi and L Fei-Fei ldquoPerceptual losses for real-time style transfer and super-resolutionrdquo Computer Vision-ECCV 2016 pp 694ndash711 2016

[41] M Sinecen ldquoComparison of genomic best linear unbiasedprediction and bayesian regularization neural networks for ge-nomic selectionrdquo IEEE Access vol 7 pp 79199ndash79210 2019

[42] L Wang Y Li and S Lazebnik ldquoLearning deep structure-preserving image-text embeddingsrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern RecognitionIEEE Las Vegas NA USA June 2016

[43] Y Y ZhangD S Zhou et al ldquoSingle-image crowd countingvia multi-column convolutional neural networkrdquo in Pro-ceedings of the Conference on Computer Vision and PatternRecognition pp 589ndash597 Las Vegas NV USA June 2016

[44] Y Xian and Y Tian ldquoSelf-guiding multimodal LSTM-whenwe do not have a perfect training dataset for image cap-tioningrdquo IEEE Transactions on Image Processing vol 28no 11 pp 5241ndash5252 2019

[45] A Karpathy and L Fei-Fei ldquoDeep visual-semantic alignmentsfor generating image descriptionsrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 39 no 4pp 664ndash676 2017

[46] K Papineni S Roukos T Ward and W-J Zhu ldquoBLEUAmethod for automatic evaluation of machine translationrdquo inProceedings of the Conference Association for ComputationalLinguistics Philadelphia PA USA 2002

Complexity 11

Page 4: Image-TextJointLearningforSocialImageswithSpatial ...Sep 27, 2019  · two scenarios are straightforward: given an input image (resp.sentence),thegoalistofindtherelationshipsbetween

(b) a same-class sample as Anchor called Positive (write to xp)and (c) an example that class is different from Anchor calledNegative (write toxn)-e purpose of triplet loss is tominimizethe distance between xa and xp and maximize the distance

between xa and xn We call this kind of ranking loss the tripletranking loss and the formulation representation is

Lrank xa xp xn1113872 1113873 max 0 α + D xa xp1113872 1113873 minus D xa xn( 11138571113872 11138731113960 1113961

(5)

Key 1

F(QK)

S1

Key 2

F(QK)

S2

Key 3

F(QK)

S3

Key i

F(QK)

Si

Somax()

A1 A2 A3 Ai

Value 1 Value 2 Value 3 Value i

+ + + +Attention value

Query1

2

3lowast lowast lowast lowast

Figure 2 Example to illustrate the process of building an attention model

A rock climber climbs in between two very large rocks hellip

ResNet-50

Embedding network(Cross-domain spatial relationship modeling)

Word2Vec TF-IDF

Word vectorImage vector

Image and wordvector

Self-attention

Spatial relationmodel

Image and word vectorwith spatial relation model

Ranking loss

(a)

Similarity network(Cross-modal matching)

Word vectorImage vector

Kmeans++recursive clustering

Savefinger_prints

ResNet-50

Imagevector

Surfing surfs surfboard

Matching

Tree-based clustering

(b)

Figure 1 -e cross-modal reasoning framework (a) Embedding network (b) Similarity network

4 Complexity

322 Spatial Relation in Image Objects We now describeobject relation computation -e basic idea of the spatialrelationships between image objects have their roots in thecrucial inference a visual or hidden object in the image spacetends to concentrate for relevant purposes but distributerandomly for irrelevant ones Let an object consists of itscoordinate feature fC and appearance feature fA In thiswork fC is a 4-dimensional object bounding box withrelative image coordinates and fA is related to the task

To be specific given input set of N visual or hiddenobjects (fn

C fnA)1113864 1113865

N

n1 the relation feature fR(n) of the wholesets concerning the nth object is computed as

fR(n) 1113944m

ωmnmiddot WV middot f

mA( 1113857 (6)

-e output is a weighted sum of appearance featuresfrom other purposes linearly transformed by WV which iscorresponding to values V in equation (1) -e relationweight ωmn indicates the impact of other objects WV andωmn can be calculated by the object-relations model [35]-ere are two steps First the coordinate features of the twoobjects are embedded in a high-dimensional representationby ResNet Inspired by the widely used bounding box re-gression method DPM [36] the elements are transformedusing log(middot) to calculate distant objects and close-by objectsSecond the embedded feature is turned into a scalar weightand trimmed -e trimming operation restricts relationsonly between objects of individual spatial relationshipswhich is related to the task and knowledge

-e cross-domain spatial relation model has the samedimension of images and texts at the output which can be anessential building block in the framework

33 Similarity Network

331 Image Representation As the gradient vanishingproblem prevents the deep network from being fully learnedwe adopt the ResNet-50 to avoid loss of accuracy by learningresidual functions to reformulate the layers concerning theinputs instead of learning unreferenced functions -ebuilding block is defined as

y f x wi1113864 1113865( 1113857 + x (7)

where x is the input vector and y is the output vector of thelayers considered -e function f is on behalf of the residualmapping to learn

As a building block has two line shortcuts we consideranother option Due to the shortcut connection and ele-ment-wise addition it operates F + x In equation (7) thenumbers of channels of x and y are equal If it is the oppositesituation we adopt the calculation

y f x wi1113864 1113865( 1113857 + wsx (8)

where ws is the operation of convolution to adjust thechannel dimension of x

We learn the image representation vectors from thelower convolutional layer In this manner the decoder canattend to specific parts of an image by selecting a subset of

the feature vectors In this study we employed 2048 featuremaps of the 50-layer residual network

332 Text Representation Prevalent model architecture forNatural Language Processing (NLP) is one-hot codingUnfortunately when the output space grows featurescannot properly span the full feature space consequentlyone-hot encoding might result insufficient for fine-grainedtasks since the projection of the outputs into a higher-di-mensional space dramatically increase the parameter spaceof computed models Also for datasets with a large numberof words the ratio of samples per word is typically reduced

A straightforward way to solve the limitations men-tioned above is to relax the problem into a real-valued linearprogramming problem and then threshold the resultingsolution We combine the vector of word embedding andfrequency by TF-IDF [27] which depicts the occurrencefrequency of a word in all texts not just the number ofoccurrences It allows each word to establish a global contextand transforms the high-dimensional sparse vector into alow-dimensional dense vector Specially we use self-atten-tion as it will enable each word to learn its relation to otherwords

333 Cross-Modal Matching

Cosine Similarity -e angle between cosines can effectivelyavoid the differences of degrees in the same cognition ofindividuals and pay more attention to the differences be-tween dimensions rather than the differences in numericalvalues We use the extracted image feature vector to allocatethe weight of the word vectors rather than getting the weightof the word from the text -en we use the resulting weightand the corresponding word vector to do dot multiplicationwith image vector to get the similarity between image andtext -is allows the semantics of the image and text tointeract

Tree-based Clustering Vector Quantization Algorithm Letx Rd be the d-dimensional instance and y 1 2 q1113864 1113865 bethe class space with a set of q possible classes Note that twotasks are needed in this term tree-based training and re-trieval After obtaining the features of the images and textsthe image and text are in the same vector space and we canuse a scalable K-means++ clustering algorithm [37] to bothimage and text vector

We develop a tree-based algorithm for cross-modallearning which presents a tree-based classificationmodel formulticlass learning A tree-based structure is constructedwhere the root node corresponds to the classes in thetraining dataset Each node v contains a set of k-means++classifiers -e top node contains all the training data At thetop level the whole dataset is partitioned into five datasubsets A B C D E -e instances are recursively par-titioned into smaller subsets while moving down the treeEach internal node contains the training instances of itschild nodes Especially each node in the tree contains twocomponents cluster vectors and predictive class vectors

Complexity 5

-e cluster vector is a vector with real values to measureclusters at a node and we adopt the definition as

pv(n) 1113936xiisinDV

Cin

Dv

11138681113868111386811138681113868111386811138681113868

(9)

wherepv(n) is thenth component ofpv |Dv| is the total numberof instances in node v Ci Yi1 Yi2 Yiq1113966 1113967 isin 0 1 q-e predictive class vector is a vector with boolean values in-dicating themembership of predictive classes at a node-evalueis 1 when pv(n) is larger than the threshold It implies that thenode is the proper class or subclass for instances of node v Notethat Ci and the threshold are obtained from the training process

-e algorithm uses three stopping criteria to determinewhen to stop growing the tree (1) A node is a leaf node ifidentified as a predictive class (2) A node is a leaf node if thedata cannot be partitioned to more than three clusters usingthe classifiers (3) -e tree gets the max depth

Figure 3 shows the flowchart of the training and retrievalsteps We save the tree model we calculate the position ofthe real image or text in the leaf nodes the path is thepictured fingerprint we save all picture fingerprints andthen the tree model is built When matching it is alsonecessary to extract image features first Starting from theroot node each clustering model recursively predicts thecategory of the vector at a different level After the image hasfallen to the leaf nodes output the path of the leaf nodes asthe fingerprint of the picture Use cosine distance to find thesame text with the fingerprint and sort it to get the an-notation -e process is shown in the Algorithms 1 and 2

4 Experiments and Results

41 Datasets and Evaluation We use the Flickr30k andMicrosoft COCO to evaluate our proposed method which iswidely used in caption-image retrieval and image captiongeneration tasks As a popular benchmark for annotationgeneration and retrieval tasks Flickr30k contains 31783images focusing mainly on people and animals and 158915English captions (five per image) We select 30783 imagesrandomly for training and another 1000 for testing splits

Microsoft COCO is the largest existing dataset with bothcaptions and region-level annotations are which consists of82783 training images and 40504 validation images and fivesentences accompany each image We use a total of 82783images for training and testing splits with 5000 images -etesting splits are images with more than three instanceswhich are selected from the validation dataset For eachtesting image we use the publicly available division which iscommonly used in the caption-image retrieval and captiongeneration tasks

42 Implementation Details We use the ResNet-50 to ex-ploit 2048 feature maps with a size of 1times1 in ldquoconv5_3rdquowhich helps us to integrate the high-level features with thelower ones and the visualization results are provided inFigure 4 By adding a full-connection layer of which di-mension is 128 we name the output of graph embedding v1In the text representation the input layer is followed by two

full-connection layers with dimensions d1 and d2 respec-tively -e output of the last full-connection layer is theoutput of text module embedding and we name it v2 Weadd a self-attention mechanism to two embedding networksand calculate to get v1prime v2prime -en v1prime and v2prime are connected to atriplet ranking loss Further we use Adam optimizer to trainthe system

After offline debugging we finally set the parameter ofthe input layer length of the text module to be 100000 Afterthe statistics of the word occurrence number of all sentencesegmentation words we take the word occurrence numberof the first 99999 and the rest words as a new word In totalthe number of neurons of the input layer is 100000

As for the text domain the second neuron number is512 the third is 128 Meanwhile the length of the graphembedding is also 128 We set the margin (in equation (6))and the parameters of Adamrsquos algorithm are as followslr 0001 beta_1 09 beta_2 0999 epsilon 1eminus08decay 00

For speeding up the training we conducted negativesampling -e sample sentence that matches the currentimage is positive To thoroughly train the low-frequencysentences we randomly select the part of the sentenceoutside the sample as negative samples

43 Result of Experiments Microsoft COCO consists of123287 images and 616767 descriptions Five text descrip-tions accompany each image We randomly select thetraining with public 82783 images and remain 5000 imagesas test data We can see that our loss function is slightlybetter than others in Figure 5 when the number of iterationswas over 400 -e Softmax loss optimizes the distance be-tween classes being great but it is weak when it comes tooptimizing the range within categories -e triplet rankingloss addresses this problem to some extent

As is shown in Figure 6 we can observe that the wordssuch as ldquoclimbsrdquo ldquorest atrdquo ldquoin therdquo which show the spatialrelationships between objects get significantly higher scoresthan other words while things like ldquotherdquo and ldquoardquo always havelower matching scores However some prepositions con-taining positional relations such as ldquobyrdquo in the right-uppercorner have higher scores than other prepositions

-e use of attention causes features related to spatialrelation traits to be weighted As is shown in Figure 7 such asthe image in the right-upper corner we can infer from thecoins in the cup next to the person lying down that he isbegging

44 Comparison with the State-of-the-Art

441 Qualitative Evaluation To evaluate the effectivenessof the proposed method we first compare its classificationperformance with state-of-the-art performance reported inthe literature We manually labelled the ground truthmapping from the nouns to the object classes By varying thethreshold the precision-recall curve can be sketched tomeasure accuracy We commonly used ldquoR1rdquo ldquoR5rdquo andldquoR10rdquo ie recall rates at the top 1 5 and 10 results

6 Complexity

Raw images input

Image features frominput image by ResNet

Feature clustering

Stopping criteria

Recurrent clustering

Saving fingerprints

N

Y

(a)

Image features frominput image by ResNet

Class prediction

Leaf node

Fingerprints asoutput

N

Y

Raw image input

Search for thesimilar fingerprint

Order by thecomputed distance

(b)

Figure 3 Flowchart of training and searching steps for tree-based algorithm (a) Flowchart of training steps for tree-based algorithm (b)Flowchart of searching steps for tree-based algorithm

Input M an social imageD the current training instancesN cluster numbersR a set of features to represent an imageFP fingerprints for the imageT cluster tree

Output the fingerprints for M(1) Initialize N R T FP and resize M(2) compute R by ResNet to represent M(3) Repeat(4) create a node v(5) update R according to N clusters to M with k-means++(6) let Dj be the set of data in D satisfying outcome j(7) attach the node returned by T(Dj) to node v(8) until Stop(D R) is true(9) return FP as a finger for the image and attach leaf nodes to FP

ALGORITHM 1 Build fingerprints with a cluster tree

Input M an social imageMD image dataset for searchN number of the image datasetFP fingerprint for the image

Output S top-k similar images for image M(1) Initialize M MD FP S and resize images(2) M and MD is represented as R by ResNet(3) Checking procedure to find the top-k similar images to M in MD(4) For each image from MD get a cosine similarity by FP of the image pair(5) return top-k similar images order by cosine similarity

ALGORITHM 2 Find top-k similar images by fingerprints

Complexity 7

(a)

0

20

40

60

80

0 20 40 60 80 100 120

(b)

Figure 4 -e visualization of ResNet-50 (a) After 50 layers of convolution (b) Fusion by 1 1 ratio

20

25

30

35

40

50

45

0 10 20 30 40 50Iter lowast 10

Reca

ll

Ranking loss

Center lossSomax

Figure 5 Recall curves when training on Microsoft COCO

A ndash020Rock 071Climber 029Climbs 032In ndash012Between 003Two ndash013Very ndash007Large 008Rocks 073

(a)

A ndash015Group 006Of ndash009Mountain 067Climbers 051Rests 054At ndash011e ndash008Summit 018

(b)

Two 003Boys 035In ndash009A ndash012Field 057Kicking 041A ndash012Soccer 022Ball 019

(c)

A ndash017Man 046Surfs 054In ndash012e ndash026Water 021

(d)

Figure 6 Similarity matching between image and text

8 Complexity

As shown in Tables 1 and 2 we compare our proposedmethod with some of the latest and most advanced processeson the Flickr30k andMicrosoft COCO datasets We can findthat our approach works better than the compared methodsDifferent fromDSPE+FVlowast that uses external text corpora tolearn discriminative sentence features our model learnsthem directly from scratch in an end-to-end manner Whencomparing among our methods we can conclude as follows(a) our attention scheme is sufficient since the model withattention consistently outperforms those without notice on

both datasets (b) using ResNet as the underlying network tounderstand the deep semantics of images to get the spatialrelation features with the help of text context relations

It takes 235 seconds to use a brute force algorithm and 18seconds to mark with a tree-based clustering algorithm onMicrosoft COCO datasets

442 Quantitative Evaluation We also report quantitativeevaluation results with the frequently used BLEUmetric [46]for the proposed datasets -e results for Microsoft COCOand Flickr30K datasets are listed in Tables 3 and 4-e imageencoders of all methods listed here are either VGG-Net orResNet which are prevalent in this field We also report theablation test in terms of discarding the spatial relationmodel

-e results demonstrate that our method has a betterperformance than discarding the spatial relation model

(a) (b)

(c) (d)

Figure 7 Examples of attending to the correct object (white indicates the attended regions words below the figure indicate the labelling ofspatial relation features) (a) Havest (b) Begging (c) Search (d) Salvage

Table 1 Results of caption-image retrieval on the Flickr30Kdataset

R1 R5 R10RVP (T+ I) [38] 119 277 477CDCCA [39] 168 393 530MNLM [40] 230 507 629BRNN [41] 222 482 614DSPE+ FVlowast [42] 403 689 799m-RNN [15] 339 651 763m-CNN [43] 346 647 768Softmax-VGG 218 507 617Softmax-ResNet 223 529 632Softmax-ResNet-Attention 356 654 756Ranking-ResNet-Attention 413 691 813lowastUsing external text corpora -e ensemble or multimodel methods

Table 2 Results of caption-image retrieval on theMicrosoft COCOdataset

R1 R5 R10DSPE+ FVlowast 494 761 864m-RNN 417 727 835Sg-LSTM [44] 428 739 847MNLM 439 757 858DVSA [45] 392 699 805m-CNN 431 739 841Softmax-VGG 475 749 843Softmax-ResNet 491 761 852Softmax-ResNet-Attention 503 775 875Ranking-ResNet-Attention 527 793 887lowastUsing external text corpora -e ensemble or multimodel methods

Table 3 -e results of different caption methods on the MicrosoftCOCO

Bleu-1 Bleu-2 Bleu-3 Bleu-4SCA-CNN [19] 0714 0543 0409 0309Semantic ATT [30] 0699 0415 0306 0232SCN-LSTM [20] 0726 0581 0421 0318Softmax-ResNet-Attention 0736 0587 0423 0325Ranking-ResNet-Attention 0741 0598 0438 0319Softmax-ResNet-Attentionlowast 0703 0524 0413 0311Ranking-ResNet-Attentionlowast 0705 0540 0416 0313lowastDiscarding spatial relation model

Table 4 -e results of different caption methods on the Flickr30K

Bleu-1 Bleu-2 Bleu-3 Bleu-4SCA-CNN [19] 0705 0515 0402 0304Semantic ATT [30] 0694 0427 0301 0224SCN-LSTM [20] 0718 0569 0413 0308Softmax-ResNet-Attention 0715 0573 0429 0325Ranking-ResNet-Attention 0729 0581 0416 0319Softmax-ResNet-Attentionlowast 0706 0533 0406 0301Ranking-ResNet-Attentionlowast 0712 0542 0418 0309lowastDiscarding spatial relation model

Complexity 9

Besides our approach is slightly better than the proposedstate-of-the-art methods which verifies the efficiency oftopic-condition in image-captioning Note that our modeluses ResNet-50 as the encoder which is a simple attentionmodel-us our approach is competitive with those models

5 Discussion and Conclusion

-is paper proposes an integrated model to recognize in-stances and objects jointly by leveraging the associatedtextual descriptions and presents a learning algorithm toestimate the model efficiently -e learning process requiresonly separate images and texts without high-level captionsWe use the residual network to deepen the learning of imagesemantic and combine with the text to obtain some hiddenrelation features contained in the picture By constructing ajoint inference module with self-attention we make a fusionof local and global elements We also show that integratingimages and text for deep semantic understanding to label thespatial relation features Furthermore the use of a treeclustering algorithm accelerates the matching process Ex-periments verify that the proposed method achieves com-petitive results on two generic annotation datasets Flickr30kand Microsoft COCO

Data Availability

-e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

-e authors declare no conflicts of interest

Acknowledgments

-e project was sponsored by the National Nature ScienceFoundation of China (41571401 and 41871226) and NaturalScience Foundation of Chongqing China (cstc2019jcyj-msxmX0131)

References

[1] J Zhu L Fang and P Ghamisi ldquoDeformable convolutionalneural networks for hyperspectral image classificationrdquo IEEEGeoscience and Remote Sensing Letters vol 15 no 8pp 1254ndash1258 2018

[2] W Lu J Li T Li W Guo H Zhang and J Guo ldquoWebmultimedia object classification using cross-domain corre-lation knowledgerdquo IEEE Transactions on Multimedia vol 15no 8 pp 1920ndash1929 2013

[3] M Katsurai T Ogawa and M Haseyama ldquoA cross-modalapproach for extracting semantic relationships betweenconcepts using tagged imagesrdquo IEEE Transactions on Mul-timedia vol 16 no 4 pp 1059ndash1074 2014

[4] R Krishna I Chami M Bernstein and L Fei-Fei ldquoReferringrelationshipsrdquo in Proceedings of the IEEECVF Conference onComputer Vision and Pattern Recognition pp 6867ndash6876 SaltLake City UT USA June 2018

[5] B Wang D Lin H Xiong and Y F Zheng ldquoJoint inferenceof objects and scenes with efficient learning of text-object-

scene relationsrdquo IEEE Transactions on Multimedia vol 18no 3 pp 507ndash520 2016

[6] K Fu J Jin R Cui F Sha and C Zhang ldquoAligning where tosee and what to tell image captioning with region-basedattention and scene-specific contextsrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 39 no 12pp 2321ndash2334 2017

[7] Q Wu C Shen L Liu A Dick and A van den HengelldquoWhat value do explicit high level concepts have in vision tolanguage problemsrdquo in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR) pp 203ndash212 Las Vegas NV USA June 2016

[8] M J Choi A Torralba and A S Willsky ldquoA tree-basedcontext model for object recognitionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 34 no 2pp 240ndash252 2012

[9] R Mottaghi X Chen X Liu et al ldquo-e role of context forobject detection and semantic segmentation in the wildrdquo inProceedings of the IEEE Conference on Computer Vision andPattern Recognition Columbus OH USA June 2014

[10] H Hu G-T Zhou Z Deng Z Liao and G Mori ldquoLearningstructured inference neural networks with label relationsrdquo inProceedings of the Computer Vision and Pattern Recognitionpp 2960ndash2968 Las Vegas NV USA June 2016

[11] J Johnson L Ballan and L Fei-Fei ldquoLove thy neighborsimage annotation by exploiting image metadatardquo in Pro-ceedings of the IEEE International Conference on ComputerVision pp 4624ndash4632 Tampa FL USA December 2015

[12] A Farhadi M Hejrati M A Sadeghi et al ldquoEvery picture tellsa story generating sentences from imagesrdquo in Proceedings ofthe European conference on computer vision Springer BerlinHeidelberg 2010

[13] S Li G Kulkarni T L Berg A C Berg and Y ChoildquoComposing simple image descriptions using web-scalen-gramsrdquo in Proceedings of the Conference on ComputationalNatural Language Learning Portland OR USA June 2011

[14] M Mitchell X Han J Dodge et al ldquoGenerating image de-scriptions from computer vision detectionsrdquo in Proceedings ofthe 13th Conference of the European Chapter of the Associationfor Computational Linguistics Avignon France 2012

[15] J Mao W Xu Y Yang J Wang Z Huang and A YuilleldquoDeep captioning with multimodal recurrent neural networks(m-RNN)rdquo in Proceedings of the ICLR pp 1ndash17 San DiegoCA USA May 2015

[16] H Hu G-T Zhou Z Deng Z Liao and G Mori ldquoLearningstructured inference neural networks with label relationsrdquo inProceedings of the CVPR pp 2960ndash2968 Las Vegas NV USAJune 2016

[17] F Liu T Xiang T M Hospedales W Yang and C SunldquoSemantic regularisation for recurrent image annotationrdquo2016 httpsarxivorgabs161105490

[18] Y Niu Z Lu J-R Wen T Xiang and S-F Chang ldquoMulti-modal multi-scale deep learning for large-scale image an-notationrdquo IEEE Transactions on Image Processing vol 28no 4 pp 1720ndash1731 2019

[19] L Chen H Zhang J Xiao et al ldquoSpatial and channel-wiseattention in convolutional networks for image captioningrdquo inProceedings of IEEE Conference Computer Vision and PatternRecognition pp 6298ndash6306 Honolulu Hawaii July 2017

[20] Z Gan ldquoSemantic compositional networks for visual cap-tioningrdquo in Proceedings of IEEE Conference Computer Visionand Pattern Recognition pp 1141ndash1150 Honolulu HI USAJuly 2017

10 Complexity

[21] A Santoro D Raposo D G Barrett et al A Simple NeuralNetwork Module for Relational Reasoning pp 4967-4976

[22] S Harnad ldquo-e symbol grounding problemrdquo Physica DNonlinear Phenomena vol 42 no 1ndash3 pp 335ndash346 1990

[23] M Garnelo K Arulkumaran and M Shanahan ldquoTowardsdeep symbolic reinforcement learningrdquo 2016 httparxivorgabs160905518

[24] B M Lake T D Ullman J B Tenenbaum andS J Gershman ldquoBuilding machines that learn and think likepeoplerdquo Behavioral and Brain Sciences vol 40 2017

[25] D Xu Y Zhu C B Choy and L Fei-Fei ldquoScene graphgeneration by iterative message passingrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognitionpp 3097ndash3106 Honolulu Hawaii July 2017

[26] Y Li W Ouyang B Zhou K Wang and X Wang ldquoScenegraph generation from objects phrases and region captionsrdquoin Proceedings of the IEEE International Conference onComputer Vision pp 1270ndash1279 Venice Italy October 2017

[27] I Yahav O Shehory and D Schwartz ldquoComments miningwith TF-IDF the inherent bias and its removalrdquo IEEETransactions on Knowledge and Data Engineering vol 31no 3 pp 437ndash450 2019

[28] K He X Zhang S Ren et al ldquoDeep residual learning forimage recognitionrdquo in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition pp 770ndash778 LasVegas NA USA July 2016

[29] F Schroff D Kalenichenko and J Philbin ldquoFaceNet a unifiedembedding for face recognition and clusteringrdquo in Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition pp 815ndash823 Boston MA USA 2015

[30] Q You H Jin Z Wang C Fang and J Luo ldquoImage cap-tioning with semantic attentionrdquo in Proceedings of IEEE theConference Computer Vision and Pattern Recognitionpp 4651ndash4659 Las Vegas NV USA June 2016

[31] D Bahdanau K Cho and Y Bengio ldquoNeural Machinetranslation by jointly learning to align and translaterdquo 2014httparxivorgabs14090473

[32] A Karpathy A Joulin and L F Fei-Fei ldquoDeep fragmentembeddings for bidirectional image sentence mappingrdquo inProceedings of the International Conference on Neural Infor-mation Processing Systems pp 1889ndash1897 Montreal Canada2014

[33] H Nam J Ha and J Kim ldquoDual attention networks formultimodal reasoning and matchingrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognitionpp 2156ndash2164 Honolulu HI USA July 2017

[34] P Young A Lai M Hodosh and J Hockenmaier ldquoFromimage descriptions to visual denotations new similaritymetrics for semantic inference over event descriptionsrdquoTransactions of the Association for Computational Linguisticsvol 2 pp 67ndash78 2014

[35] H Hu J Gu Z Zhang J Dai and Y Wei ldquoRelation networksfor object detectionrdquo in Proceedings of the IEEECVF Con-ference on Computer Vision and Pattern Recognitionpp 3588ndash3597 Salt Lake City UT USA June 2018

[36] P Felzenszwalb R Girshick D McAllester and D RamananldquoObject detection with discriminatively trained part basedmodelsrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 32 no 9 2010

[37] B Bahmani ldquoScalable K-Means++rdquo in Proceedings of the VldbEndowment 5(7) pp 622ndash633 Istanbul Turkey August 2012

[38] R Kiros R Salakhutdinov and R S Zemel ldquoUnifying visual-semantic embeddings with multimodal neural languagemodelsrdquo 2014 httparxivorgabs14112539

[39] Y Yu S Tang K Aizawa and A Aizawa ldquoCategory-baseddeep CCA for fine-grained venue discovery from multimodaldatardquo IEEE Transactions on Neural Networks and LearningSystems vol 30 no 4 pp 1250ndash1258 2019

[40] J Johnson A Alahi and L Fei-Fei ldquoPerceptual losses for real-time style transfer and super-resolutionrdquo Computer Vision-ECCV 2016 pp 694ndash711 2016

[41] M Sinecen ldquoComparison of genomic best linear unbiasedprediction and bayesian regularization neural networks for ge-nomic selectionrdquo IEEE Access vol 7 pp 79199ndash79210 2019

[42] L Wang Y Li and S Lazebnik ldquoLearning deep structure-preserving image-text embeddingsrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern RecognitionIEEE Las Vegas NA USA June 2016

[43] Y Y ZhangD S Zhou et al ldquoSingle-image crowd countingvia multi-column convolutional neural networkrdquo in Pro-ceedings of the Conference on Computer Vision and PatternRecognition pp 589ndash597 Las Vegas NV USA June 2016

[44] Y Xian and Y Tian ldquoSelf-guiding multimodal LSTM-whenwe do not have a perfect training dataset for image cap-tioningrdquo IEEE Transactions on Image Processing vol 28no 11 pp 5241ndash5252 2019

[45] A Karpathy and L Fei-Fei ldquoDeep visual-semantic alignmentsfor generating image descriptionsrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 39 no 4pp 664ndash676 2017

[46] K Papineni S Roukos T Ward and W-J Zhu ldquoBLEUAmethod for automatic evaluation of machine translationrdquo inProceedings of the Conference Association for ComputationalLinguistics Philadelphia PA USA 2002

Complexity 11

Page 5: Image-TextJointLearningforSocialImageswithSpatial ...Sep 27, 2019  · two scenarios are straightforward: given an input image (resp.sentence),thegoalistofindtherelationshipsbetween

322 Spatial Relation in Image Objects We now describeobject relation computation -e basic idea of the spatialrelationships between image objects have their roots in thecrucial inference a visual or hidden object in the image spacetends to concentrate for relevant purposes but distributerandomly for irrelevant ones Let an object consists of itscoordinate feature fC and appearance feature fA In thiswork fC is a 4-dimensional object bounding box withrelative image coordinates and fA is related to the task

To be specific given input set of N visual or hiddenobjects (fn

C fnA)1113864 1113865

N

n1 the relation feature fR(n) of the wholesets concerning the nth object is computed as

fR(n) 1113944m

ωmnmiddot WV middot f

mA( 1113857 (6)

-e output is a weighted sum of appearance featuresfrom other purposes linearly transformed by WV which iscorresponding to values V in equation (1) -e relationweight ωmn indicates the impact of other objects WV andωmn can be calculated by the object-relations model [35]-ere are two steps First the coordinate features of the twoobjects are embedded in a high-dimensional representationby ResNet Inspired by the widely used bounding box re-gression method DPM [36] the elements are transformedusing log(middot) to calculate distant objects and close-by objectsSecond the embedded feature is turned into a scalar weightand trimmed -e trimming operation restricts relationsonly between objects of individual spatial relationshipswhich is related to the task and knowledge

-e cross-domain spatial relation model has the samedimension of images and texts at the output which can be anessential building block in the framework

33 Similarity Network

331 Image Representation As the gradient vanishingproblem prevents the deep network from being fully learnedwe adopt the ResNet-50 to avoid loss of accuracy by learningresidual functions to reformulate the layers concerning theinputs instead of learning unreferenced functions -ebuilding block is defined as

y f x wi1113864 1113865( 1113857 + x (7)

where x is the input vector and y is the output vector of thelayers considered -e function f is on behalf of the residualmapping to learn

As a building block has two line shortcuts we consideranother option Due to the shortcut connection and ele-ment-wise addition it operates F + x In equation (7) thenumbers of channels of x and y are equal If it is the oppositesituation we adopt the calculation

y f x wi1113864 1113865( 1113857 + wsx (8)

where ws is the operation of convolution to adjust thechannel dimension of x

We learn the image representation vectors from thelower convolutional layer In this manner the decoder canattend to specific parts of an image by selecting a subset of

the feature vectors In this study we employed 2048 featuremaps of the 50-layer residual network

332 Text Representation Prevalent model architecture forNatural Language Processing (NLP) is one-hot codingUnfortunately when the output space grows featurescannot properly span the full feature space consequentlyone-hot encoding might result insufficient for fine-grainedtasks since the projection of the outputs into a higher-di-mensional space dramatically increase the parameter spaceof computed models Also for datasets with a large numberof words the ratio of samples per word is typically reduced

A straightforward way to solve the limitations men-tioned above is to relax the problem into a real-valued linearprogramming problem and then threshold the resultingsolution We combine the vector of word embedding andfrequency by TF-IDF [27] which depicts the occurrencefrequency of a word in all texts not just the number ofoccurrences It allows each word to establish a global contextand transforms the high-dimensional sparse vector into alow-dimensional dense vector Specially we use self-atten-tion as it will enable each word to learn its relation to otherwords

333 Cross-Modal Matching

Cosine Similarity -e angle between cosines can effectivelyavoid the differences of degrees in the same cognition ofindividuals and pay more attention to the differences be-tween dimensions rather than the differences in numericalvalues We use the extracted image feature vector to allocatethe weight of the word vectors rather than getting the weightof the word from the text -en we use the resulting weightand the corresponding word vector to do dot multiplicationwith image vector to get the similarity between image andtext -is allows the semantics of the image and text tointeract

Tree-based Clustering Vector Quantization Algorithm Letx Rd be the d-dimensional instance and y 1 2 q1113864 1113865 bethe class space with a set of q possible classes Note that twotasks are needed in this term tree-based training and re-trieval After obtaining the features of the images and textsthe image and text are in the same vector space and we canuse a scalable K-means++ clustering algorithm [37] to bothimage and text vector

We develop a tree-based algorithm for cross-modallearning which presents a tree-based classificationmodel formulticlass learning A tree-based structure is constructedwhere the root node corresponds to the classes in thetraining dataset Each node v contains a set of k-means++classifiers -e top node contains all the training data At thetop level the whole dataset is partitioned into five datasubsets A B C D E -e instances are recursively par-titioned into smaller subsets while moving down the treeEach internal node contains the training instances of itschild nodes Especially each node in the tree contains twocomponents cluster vectors and predictive class vectors

Complexity 5

-e cluster vector is a vector with real values to measureclusters at a node and we adopt the definition as

pv(n) 1113936xiisinDV

Cin

Dv

11138681113868111386811138681113868111386811138681113868

(9)

wherepv(n) is thenth component ofpv |Dv| is the total numberof instances in node v Ci Yi1 Yi2 Yiq1113966 1113967 isin 0 1 q-e predictive class vector is a vector with boolean values in-dicating themembership of predictive classes at a node-evalueis 1 when pv(n) is larger than the threshold It implies that thenode is the proper class or subclass for instances of node v Notethat Ci and the threshold are obtained from the training process

-e algorithm uses three stopping criteria to determinewhen to stop growing the tree (1) A node is a leaf node ifidentified as a predictive class (2) A node is a leaf node if thedata cannot be partitioned to more than three clusters usingthe classifiers (3) -e tree gets the max depth

Figure 3 shows the flowchart of the training and retrievalsteps We save the tree model we calculate the position ofthe real image or text in the leaf nodes the path is thepictured fingerprint we save all picture fingerprints andthen the tree model is built When matching it is alsonecessary to extract image features first Starting from theroot node each clustering model recursively predicts thecategory of the vector at a different level After the image hasfallen to the leaf nodes output the path of the leaf nodes asthe fingerprint of the picture Use cosine distance to find thesame text with the fingerprint and sort it to get the an-notation -e process is shown in the Algorithms 1 and 2

4 Experiments and Results

41 Datasets and Evaluation We use the Flickr30k andMicrosoft COCO to evaluate our proposed method which iswidely used in caption-image retrieval and image captiongeneration tasks As a popular benchmark for annotationgeneration and retrieval tasks Flickr30k contains 31783images focusing mainly on people and animals and 158915English captions (five per image) We select 30783 imagesrandomly for training and another 1000 for testing splits

Microsoft COCO is the largest existing dataset with bothcaptions and region-level annotations are which consists of82783 training images and 40504 validation images and fivesentences accompany each image We use a total of 82783images for training and testing splits with 5000 images -etesting splits are images with more than three instanceswhich are selected from the validation dataset For eachtesting image we use the publicly available division which iscommonly used in the caption-image retrieval and captiongeneration tasks

42 Implementation Details We use the ResNet-50 to ex-ploit 2048 feature maps with a size of 1times1 in ldquoconv5_3rdquowhich helps us to integrate the high-level features with thelower ones and the visualization results are provided inFigure 4 By adding a full-connection layer of which di-mension is 128 we name the output of graph embedding v1In the text representation the input layer is followed by two

full-connection layers with dimensions d1 and d2 respec-tively -e output of the last full-connection layer is theoutput of text module embedding and we name it v2 Weadd a self-attention mechanism to two embedding networksand calculate to get v1prime v2prime -en v1prime and v2prime are connected to atriplet ranking loss Further we use Adam optimizer to trainthe system

After offline debugging we finally set the parameter ofthe input layer length of the text module to be 100000 Afterthe statistics of the word occurrence number of all sentencesegmentation words we take the word occurrence numberof the first 99999 and the rest words as a new word In totalthe number of neurons of the input layer is 100000

As for the text domain the second neuron number is512 the third is 128 Meanwhile the length of the graphembedding is also 128 We set the margin (in equation (6))and the parameters of Adamrsquos algorithm are as followslr 0001 beta_1 09 beta_2 0999 epsilon 1eminus08decay 00

For speeding up the training we conducted negativesampling -e sample sentence that matches the currentimage is positive To thoroughly train the low-frequencysentences we randomly select the part of the sentenceoutside the sample as negative samples

43 Result of Experiments Microsoft COCO consists of123287 images and 616767 descriptions Five text descrip-tions accompany each image We randomly select thetraining with public 82783 images and remain 5000 imagesas test data We can see that our loss function is slightlybetter than others in Figure 5 when the number of iterationswas over 400 -e Softmax loss optimizes the distance be-tween classes being great but it is weak when it comes tooptimizing the range within categories -e triplet rankingloss addresses this problem to some extent

As is shown in Figure 6 we can observe that the wordssuch as ldquoclimbsrdquo ldquorest atrdquo ldquoin therdquo which show the spatialrelationships between objects get significantly higher scoresthan other words while things like ldquotherdquo and ldquoardquo always havelower matching scores However some prepositions con-taining positional relations such as ldquobyrdquo in the right-uppercorner have higher scores than other prepositions

-e use of attention causes features related to spatialrelation traits to be weighted As is shown in Figure 7 such asthe image in the right-upper corner we can infer from thecoins in the cup next to the person lying down that he isbegging

44 Comparison with the State-of-the-Art

441 Qualitative Evaluation To evaluate the effectivenessof the proposed method we first compare its classificationperformance with state-of-the-art performance reported inthe literature We manually labelled the ground truthmapping from the nouns to the object classes By varying thethreshold the precision-recall curve can be sketched tomeasure accuracy We commonly used ldquoR1rdquo ldquoR5rdquo andldquoR10rdquo ie recall rates at the top 1 5 and 10 results

6 Complexity

Raw images input

Image features frominput image by ResNet

Feature clustering

Stopping criteria

Recurrent clustering

Saving fingerprints

N

Y

(a)

Image features frominput image by ResNet

Class prediction

Leaf node

Fingerprints asoutput

N

Y

Raw image input

Search for thesimilar fingerprint

Order by thecomputed distance

(b)

Figure 3 Flowchart of training and searching steps for tree-based algorithm (a) Flowchart of training steps for tree-based algorithm (b)Flowchart of searching steps for tree-based algorithm

Input M an social imageD the current training instancesN cluster numbersR a set of features to represent an imageFP fingerprints for the imageT cluster tree

Output the fingerprints for M(1) Initialize N R T FP and resize M(2) compute R by ResNet to represent M(3) Repeat(4) create a node v(5) update R according to N clusters to M with k-means++(6) let Dj be the set of data in D satisfying outcome j(7) attach the node returned by T(Dj) to node v(8) until Stop(D R) is true(9) return FP as a finger for the image and attach leaf nodes to FP

ALGORITHM 1 Build fingerprints with a cluster tree

Input M an social imageMD image dataset for searchN number of the image datasetFP fingerprint for the image

Output S top-k similar images for image M(1) Initialize M MD FP S and resize images(2) M and MD is represented as R by ResNet(3) Checking procedure to find the top-k similar images to M in MD(4) For each image from MD get a cosine similarity by FP of the image pair(5) return top-k similar images order by cosine similarity

ALGORITHM 2 Find top-k similar images by fingerprints

Complexity 7

(a)

0

20

40

60

80

0 20 40 60 80 100 120

(b)

Figure 4 -e visualization of ResNet-50 (a) After 50 layers of convolution (b) Fusion by 1 1 ratio

20

25

30

35

40

50

45

0 10 20 30 40 50Iter lowast 10

Reca

ll

Ranking loss

Center lossSomax

Figure 5 Recall curves when training on Microsoft COCO

A ndash020Rock 071Climber 029Climbs 032In ndash012Between 003Two ndash013Very ndash007Large 008Rocks 073

(a)

A ndash015Group 006Of ndash009Mountain 067Climbers 051Rests 054At ndash011e ndash008Summit 018

(b)

Two 003Boys 035In ndash009A ndash012Field 057Kicking 041A ndash012Soccer 022Ball 019

(c)

A ndash017Man 046Surfs 054In ndash012e ndash026Water 021

(d)

Figure 6 Similarity matching between image and text

8 Complexity

As shown in Tables 1 and 2 we compare our proposedmethod with some of the latest and most advanced processeson the Flickr30k andMicrosoft COCO datasets We can findthat our approach works better than the compared methodsDifferent fromDSPE+FVlowast that uses external text corpora tolearn discriminative sentence features our model learnsthem directly from scratch in an end-to-end manner Whencomparing among our methods we can conclude as follows(a) our attention scheme is sufficient since the model withattention consistently outperforms those without notice on

both datasets (b) using ResNet as the underlying network tounderstand the deep semantics of images to get the spatialrelation features with the help of text context relations

It takes 235 seconds to use a brute force algorithm and 18seconds to mark with a tree-based clustering algorithm onMicrosoft COCO datasets

442 Quantitative Evaluation We also report quantitativeevaluation results with the frequently used BLEUmetric [46]for the proposed datasets -e results for Microsoft COCOand Flickr30K datasets are listed in Tables 3 and 4-e imageencoders of all methods listed here are either VGG-Net orResNet which are prevalent in this field We also report theablation test in terms of discarding the spatial relationmodel

-e results demonstrate that our method has a betterperformance than discarding the spatial relation model

(a) (b)

(c) (d)

Figure 7 Examples of attending to the correct object (white indicates the attended regions words below the figure indicate the labelling ofspatial relation features) (a) Havest (b) Begging (c) Search (d) Salvage

Table 1 Results of caption-image retrieval on the Flickr30Kdataset

R1 R5 R10RVP (T+ I) [38] 119 277 477CDCCA [39] 168 393 530MNLM [40] 230 507 629BRNN [41] 222 482 614DSPE+ FVlowast [42] 403 689 799m-RNN [15] 339 651 763m-CNN [43] 346 647 768Softmax-VGG 218 507 617Softmax-ResNet 223 529 632Softmax-ResNet-Attention 356 654 756Ranking-ResNet-Attention 413 691 813lowastUsing external text corpora -e ensemble or multimodel methods

Table 2 Results of caption-image retrieval on theMicrosoft COCOdataset

R1 R5 R10DSPE+ FVlowast 494 761 864m-RNN 417 727 835Sg-LSTM [44] 428 739 847MNLM 439 757 858DVSA [45] 392 699 805m-CNN 431 739 841Softmax-VGG 475 749 843Softmax-ResNet 491 761 852Softmax-ResNet-Attention 503 775 875Ranking-ResNet-Attention 527 793 887lowastUsing external text corpora -e ensemble or multimodel methods

Table 3 -e results of different caption methods on the MicrosoftCOCO

Bleu-1 Bleu-2 Bleu-3 Bleu-4SCA-CNN [19] 0714 0543 0409 0309Semantic ATT [30] 0699 0415 0306 0232SCN-LSTM [20] 0726 0581 0421 0318Softmax-ResNet-Attention 0736 0587 0423 0325Ranking-ResNet-Attention 0741 0598 0438 0319Softmax-ResNet-Attentionlowast 0703 0524 0413 0311Ranking-ResNet-Attentionlowast 0705 0540 0416 0313lowastDiscarding spatial relation model

Table 4 -e results of different caption methods on the Flickr30K

Bleu-1 Bleu-2 Bleu-3 Bleu-4SCA-CNN [19] 0705 0515 0402 0304Semantic ATT [30] 0694 0427 0301 0224SCN-LSTM [20] 0718 0569 0413 0308Softmax-ResNet-Attention 0715 0573 0429 0325Ranking-ResNet-Attention 0729 0581 0416 0319Softmax-ResNet-Attentionlowast 0706 0533 0406 0301Ranking-ResNet-Attentionlowast 0712 0542 0418 0309lowastDiscarding spatial relation model

Complexity 9

Besides our approach is slightly better than the proposedstate-of-the-art methods which verifies the efficiency oftopic-condition in image-captioning Note that our modeluses ResNet-50 as the encoder which is a simple attentionmodel-us our approach is competitive with those models

5 Discussion and Conclusion

-is paper proposes an integrated model to recognize in-stances and objects jointly by leveraging the associatedtextual descriptions and presents a learning algorithm toestimate the model efficiently -e learning process requiresonly separate images and texts without high-level captionsWe use the residual network to deepen the learning of imagesemantic and combine with the text to obtain some hiddenrelation features contained in the picture By constructing ajoint inference module with self-attention we make a fusionof local and global elements We also show that integratingimages and text for deep semantic understanding to label thespatial relation features Furthermore the use of a treeclustering algorithm accelerates the matching process Ex-periments verify that the proposed method achieves com-petitive results on two generic annotation datasets Flickr30kand Microsoft COCO

Data Availability

-e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

-e authors declare no conflicts of interest

Acknowledgments

-e project was sponsored by the National Nature ScienceFoundation of China (41571401 and 41871226) and NaturalScience Foundation of Chongqing China (cstc2019jcyj-msxmX0131)

References

[1] J Zhu L Fang and P Ghamisi ldquoDeformable convolutionalneural networks for hyperspectral image classificationrdquo IEEEGeoscience and Remote Sensing Letters vol 15 no 8pp 1254ndash1258 2018

[2] W Lu J Li T Li W Guo H Zhang and J Guo ldquoWebmultimedia object classification using cross-domain corre-lation knowledgerdquo IEEE Transactions on Multimedia vol 15no 8 pp 1920ndash1929 2013

[3] M Katsurai T Ogawa and M Haseyama ldquoA cross-modalapproach for extracting semantic relationships betweenconcepts using tagged imagesrdquo IEEE Transactions on Mul-timedia vol 16 no 4 pp 1059ndash1074 2014

[4] R Krishna I Chami M Bernstein and L Fei-Fei ldquoReferringrelationshipsrdquo in Proceedings of the IEEECVF Conference onComputer Vision and Pattern Recognition pp 6867ndash6876 SaltLake City UT USA June 2018

[5] B Wang D Lin H Xiong and Y F Zheng ldquoJoint inferenceof objects and scenes with efficient learning of text-object-

scene relationsrdquo IEEE Transactions on Multimedia vol 18no 3 pp 507ndash520 2016

[6] K Fu J Jin R Cui F Sha and C Zhang ldquoAligning where tosee and what to tell image captioning with region-basedattention and scene-specific contextsrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 39 no 12pp 2321ndash2334 2017

[7] Q Wu C Shen L Liu A Dick and A van den HengelldquoWhat value do explicit high level concepts have in vision tolanguage problemsrdquo in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR) pp 203ndash212 Las Vegas NV USA June 2016

[8] M J Choi A Torralba and A S Willsky ldquoA tree-basedcontext model for object recognitionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 34 no 2pp 240ndash252 2012

[9] R Mottaghi X Chen X Liu et al ldquo-e role of context forobject detection and semantic segmentation in the wildrdquo inProceedings of the IEEE Conference on Computer Vision andPattern Recognition Columbus OH USA June 2014

[10] H Hu G-T Zhou Z Deng Z Liao and G Mori ldquoLearningstructured inference neural networks with label relationsrdquo inProceedings of the Computer Vision and Pattern Recognitionpp 2960ndash2968 Las Vegas NV USA June 2016

[11] J Johnson L Ballan and L Fei-Fei ldquoLove thy neighborsimage annotation by exploiting image metadatardquo in Pro-ceedings of the IEEE International Conference on ComputerVision pp 4624ndash4632 Tampa FL USA December 2015

[12] A Farhadi M Hejrati M A Sadeghi et al ldquoEvery picture tellsa story generating sentences from imagesrdquo in Proceedings ofthe European conference on computer vision Springer BerlinHeidelberg 2010

[13] S Li G Kulkarni T L Berg A C Berg and Y ChoildquoComposing simple image descriptions using web-scalen-gramsrdquo in Proceedings of the Conference on ComputationalNatural Language Learning Portland OR USA June 2011

[14] M Mitchell X Han J Dodge et al ldquoGenerating image de-scriptions from computer vision detectionsrdquo in Proceedings ofthe 13th Conference of the European Chapter of the Associationfor Computational Linguistics Avignon France 2012

[15] J Mao W Xu Y Yang J Wang Z Huang and A YuilleldquoDeep captioning with multimodal recurrent neural networks(m-RNN)rdquo in Proceedings of the ICLR pp 1ndash17 San DiegoCA USA May 2015

[16] H Hu G-T Zhou Z Deng Z Liao and G Mori ldquoLearningstructured inference neural networks with label relationsrdquo inProceedings of the CVPR pp 2960ndash2968 Las Vegas NV USAJune 2016

[17] F Liu T Xiang T M Hospedales W Yang and C SunldquoSemantic regularisation for recurrent image annotationrdquo2016 httpsarxivorgabs161105490

[18] Y Niu Z Lu J-R Wen T Xiang and S-F Chang ldquoMulti-modal multi-scale deep learning for large-scale image an-notationrdquo IEEE Transactions on Image Processing vol 28no 4 pp 1720ndash1731 2019

[19] L Chen H Zhang J Xiao et al ldquoSpatial and channel-wiseattention in convolutional networks for image captioningrdquo inProceedings of IEEE Conference Computer Vision and PatternRecognition pp 6298ndash6306 Honolulu Hawaii July 2017

[20] Z Gan ldquoSemantic compositional networks for visual cap-tioningrdquo in Proceedings of IEEE Conference Computer Visionand Pattern Recognition pp 1141ndash1150 Honolulu HI USAJuly 2017

10 Complexity

[21] A Santoro D Raposo D G Barrett et al A Simple NeuralNetwork Module for Relational Reasoning pp 4967-4976

[22] S Harnad ldquo-e symbol grounding problemrdquo Physica DNonlinear Phenomena vol 42 no 1ndash3 pp 335ndash346 1990

[23] M Garnelo K Arulkumaran and M Shanahan ldquoTowardsdeep symbolic reinforcement learningrdquo 2016 httparxivorgabs160905518

[24] B M Lake T D Ullman J B Tenenbaum andS J Gershman ldquoBuilding machines that learn and think likepeoplerdquo Behavioral and Brain Sciences vol 40 2017

[25] D Xu Y Zhu C B Choy and L Fei-Fei ldquoScene graphgeneration by iterative message passingrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognitionpp 3097ndash3106 Honolulu Hawaii July 2017

[26] Y Li W Ouyang B Zhou K Wang and X Wang ldquoScenegraph generation from objects phrases and region captionsrdquoin Proceedings of the IEEE International Conference onComputer Vision pp 1270ndash1279 Venice Italy October 2017

[27] I Yahav O Shehory and D Schwartz ldquoComments miningwith TF-IDF the inherent bias and its removalrdquo IEEETransactions on Knowledge and Data Engineering vol 31no 3 pp 437ndash450 2019

[28] K He X Zhang S Ren et al ldquoDeep residual learning forimage recognitionrdquo in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition pp 770ndash778 LasVegas NA USA July 2016

[29] F Schroff D Kalenichenko and J Philbin ldquoFaceNet a unifiedembedding for face recognition and clusteringrdquo in Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition pp 815ndash823 Boston MA USA 2015

[30] Q You H Jin Z Wang C Fang and J Luo ldquoImage cap-tioning with semantic attentionrdquo in Proceedings of IEEE theConference Computer Vision and Pattern Recognitionpp 4651ndash4659 Las Vegas NV USA June 2016

[31] D Bahdanau K Cho and Y Bengio ldquoNeural Machinetranslation by jointly learning to align and translaterdquo 2014httparxivorgabs14090473

[32] A Karpathy A Joulin and L F Fei-Fei ldquoDeep fragmentembeddings for bidirectional image sentence mappingrdquo inProceedings of the International Conference on Neural Infor-mation Processing Systems pp 1889ndash1897 Montreal Canada2014

[33] H Nam J Ha and J Kim ldquoDual attention networks formultimodal reasoning and matchingrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognitionpp 2156ndash2164 Honolulu HI USA July 2017

[34] P Young A Lai M Hodosh and J Hockenmaier ldquoFromimage descriptions to visual denotations new similaritymetrics for semantic inference over event descriptionsrdquoTransactions of the Association for Computational Linguisticsvol 2 pp 67ndash78 2014

[35] H Hu J Gu Z Zhang J Dai and Y Wei ldquoRelation networksfor object detectionrdquo in Proceedings of the IEEECVF Con-ference on Computer Vision and Pattern Recognitionpp 3588ndash3597 Salt Lake City UT USA June 2018

[36] P Felzenszwalb R Girshick D McAllester and D RamananldquoObject detection with discriminatively trained part basedmodelsrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 32 no 9 2010

[37] B Bahmani ldquoScalable K-Means++rdquo in Proceedings of the VldbEndowment 5(7) pp 622ndash633 Istanbul Turkey August 2012

[38] R Kiros R Salakhutdinov and R S Zemel ldquoUnifying visual-semantic embeddings with multimodal neural languagemodelsrdquo 2014 httparxivorgabs14112539

[39] Y Yu S Tang K Aizawa and A Aizawa ldquoCategory-baseddeep CCA for fine-grained venue discovery from multimodaldatardquo IEEE Transactions on Neural Networks and LearningSystems vol 30 no 4 pp 1250ndash1258 2019

[40] J Johnson A Alahi and L Fei-Fei ldquoPerceptual losses for real-time style transfer and super-resolutionrdquo Computer Vision-ECCV 2016 pp 694ndash711 2016

[41] M Sinecen ldquoComparison of genomic best linear unbiasedprediction and bayesian regularization neural networks for ge-nomic selectionrdquo IEEE Access vol 7 pp 79199ndash79210 2019

[42] L Wang Y Li and S Lazebnik ldquoLearning deep structure-preserving image-text embeddingsrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern RecognitionIEEE Las Vegas NA USA June 2016

[43] Y Y ZhangD S Zhou et al ldquoSingle-image crowd countingvia multi-column convolutional neural networkrdquo in Pro-ceedings of the Conference on Computer Vision and PatternRecognition pp 589ndash597 Las Vegas NV USA June 2016

[44] Y Xian and Y Tian ldquoSelf-guiding multimodal LSTM-whenwe do not have a perfect training dataset for image cap-tioningrdquo IEEE Transactions on Image Processing vol 28no 11 pp 5241ndash5252 2019

[45] A Karpathy and L Fei-Fei ldquoDeep visual-semantic alignmentsfor generating image descriptionsrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 39 no 4pp 664ndash676 2017

[46] K Papineni S Roukos T Ward and W-J Zhu ldquoBLEUAmethod for automatic evaluation of machine translationrdquo inProceedings of the Conference Association for ComputationalLinguistics Philadelphia PA USA 2002

Complexity 11

Page 6: Image-TextJointLearningforSocialImageswithSpatial ...Sep 27, 2019  · two scenarios are straightforward: given an input image (resp.sentence),thegoalistofindtherelationshipsbetween

-e cluster vector is a vector with real values to measureclusters at a node and we adopt the definition as

pv(n) 1113936xiisinDV

Cin

Dv

11138681113868111386811138681113868111386811138681113868

(9)

wherepv(n) is thenth component ofpv |Dv| is the total numberof instances in node v Ci Yi1 Yi2 Yiq1113966 1113967 isin 0 1 q-e predictive class vector is a vector with boolean values in-dicating themembership of predictive classes at a node-evalueis 1 when pv(n) is larger than the threshold It implies that thenode is the proper class or subclass for instances of node v Notethat Ci and the threshold are obtained from the training process

-e algorithm uses three stopping criteria to determinewhen to stop growing the tree (1) A node is a leaf node ifidentified as a predictive class (2) A node is a leaf node if thedata cannot be partitioned to more than three clusters usingthe classifiers (3) -e tree gets the max depth

Figure 3 shows the flowchart of the training and retrievalsteps We save the tree model we calculate the position ofthe real image or text in the leaf nodes the path is thepictured fingerprint we save all picture fingerprints andthen the tree model is built When matching it is alsonecessary to extract image features first Starting from theroot node each clustering model recursively predicts thecategory of the vector at a different level After the image hasfallen to the leaf nodes output the path of the leaf nodes asthe fingerprint of the picture Use cosine distance to find thesame text with the fingerprint and sort it to get the an-notation -e process is shown in the Algorithms 1 and 2

4 Experiments and Results

41 Datasets and Evaluation We use the Flickr30k andMicrosoft COCO to evaluate our proposed method which iswidely used in caption-image retrieval and image captiongeneration tasks As a popular benchmark for annotationgeneration and retrieval tasks Flickr30k contains 31783images focusing mainly on people and animals and 158915English captions (five per image) We select 30783 imagesrandomly for training and another 1000 for testing splits

Microsoft COCO is the largest existing dataset with bothcaptions and region-level annotations are which consists of82783 training images and 40504 validation images and fivesentences accompany each image We use a total of 82783images for training and testing splits with 5000 images -etesting splits are images with more than three instanceswhich are selected from the validation dataset For eachtesting image we use the publicly available division which iscommonly used in the caption-image retrieval and captiongeneration tasks

42 Implementation Details We use the ResNet-50 to ex-ploit 2048 feature maps with a size of 1times1 in ldquoconv5_3rdquowhich helps us to integrate the high-level features with thelower ones and the visualization results are provided inFigure 4 By adding a full-connection layer of which di-mension is 128 we name the output of graph embedding v1In the text representation the input layer is followed by two

full-connection layers with dimensions d1 and d2 respec-tively -e output of the last full-connection layer is theoutput of text module embedding and we name it v2 Weadd a self-attention mechanism to two embedding networksand calculate to get v1prime v2prime -en v1prime and v2prime are connected to atriplet ranking loss Further we use Adam optimizer to trainthe system

After offline debugging we finally set the parameter ofthe input layer length of the text module to be 100000 Afterthe statistics of the word occurrence number of all sentencesegmentation words we take the word occurrence numberof the first 99999 and the rest words as a new word In totalthe number of neurons of the input layer is 100000

As for the text domain the second neuron number is512 the third is 128 Meanwhile the length of the graphembedding is also 128 We set the margin (in equation (6))and the parameters of Adamrsquos algorithm are as followslr 0001 beta_1 09 beta_2 0999 epsilon 1eminus08decay 00

For speeding up the training we conducted negativesampling -e sample sentence that matches the currentimage is positive To thoroughly train the low-frequencysentences we randomly select the part of the sentenceoutside the sample as negative samples

43 Result of Experiments Microsoft COCO consists of123287 images and 616767 descriptions Five text descrip-tions accompany each image We randomly select thetraining with public 82783 images and remain 5000 imagesas test data We can see that our loss function is slightlybetter than others in Figure 5 when the number of iterationswas over 400 -e Softmax loss optimizes the distance be-tween classes being great but it is weak when it comes tooptimizing the range within categories -e triplet rankingloss addresses this problem to some extent

As is shown in Figure 6 we can observe that the wordssuch as ldquoclimbsrdquo ldquorest atrdquo ldquoin therdquo which show the spatialrelationships between objects get significantly higher scoresthan other words while things like ldquotherdquo and ldquoardquo always havelower matching scores However some prepositions con-taining positional relations such as ldquobyrdquo in the right-uppercorner have higher scores than other prepositions

-e use of attention causes features related to spatialrelation traits to be weighted As is shown in Figure 7 such asthe image in the right-upper corner we can infer from thecoins in the cup next to the person lying down that he isbegging

44 Comparison with the State-of-the-Art

441 Qualitative Evaluation To evaluate the effectivenessof the proposed method we first compare its classificationperformance with state-of-the-art performance reported inthe literature We manually labelled the ground truthmapping from the nouns to the object classes By varying thethreshold the precision-recall curve can be sketched tomeasure accuracy We commonly used ldquoR1rdquo ldquoR5rdquo andldquoR10rdquo ie recall rates at the top 1 5 and 10 results

6 Complexity

Raw images input

Image features frominput image by ResNet

Feature clustering

Stopping criteria

Recurrent clustering

Saving fingerprints

N

Y

(a)

Image features frominput image by ResNet

Class prediction

Leaf node

Fingerprints asoutput

N

Y

Raw image input

Search for thesimilar fingerprint

Order by thecomputed distance

(b)

Figure 3 Flowchart of training and searching steps for tree-based algorithm (a) Flowchart of training steps for tree-based algorithm (b)Flowchart of searching steps for tree-based algorithm

Input M an social imageD the current training instancesN cluster numbersR a set of features to represent an imageFP fingerprints for the imageT cluster tree

Output the fingerprints for M(1) Initialize N R T FP and resize M(2) compute R by ResNet to represent M(3) Repeat(4) create a node v(5) update R according to N clusters to M with k-means++(6) let Dj be the set of data in D satisfying outcome j(7) attach the node returned by T(Dj) to node v(8) until Stop(D R) is true(9) return FP as a finger for the image and attach leaf nodes to FP

ALGORITHM 1 Build fingerprints with a cluster tree

Input M an social imageMD image dataset for searchN number of the image datasetFP fingerprint for the image

Output S top-k similar images for image M(1) Initialize M MD FP S and resize images(2) M and MD is represented as R by ResNet(3) Checking procedure to find the top-k similar images to M in MD(4) For each image from MD get a cosine similarity by FP of the image pair(5) return top-k similar images order by cosine similarity

ALGORITHM 2 Find top-k similar images by fingerprints

Complexity 7

(a)

0

20

40

60

80

0 20 40 60 80 100 120

(b)

Figure 4 -e visualization of ResNet-50 (a) After 50 layers of convolution (b) Fusion by 1 1 ratio

20

25

30

35

40

50

45

0 10 20 30 40 50Iter lowast 10

Reca

ll

Ranking loss

Center lossSomax

Figure 5 Recall curves when training on Microsoft COCO

A ndash020Rock 071Climber 029Climbs 032In ndash012Between 003Two ndash013Very ndash007Large 008Rocks 073

(a)

A ndash015Group 006Of ndash009Mountain 067Climbers 051Rests 054At ndash011e ndash008Summit 018

(b)

Two 003Boys 035In ndash009A ndash012Field 057Kicking 041A ndash012Soccer 022Ball 019

(c)

A ndash017Man 046Surfs 054In ndash012e ndash026Water 021

(d)

Figure 6 Similarity matching between image and text

8 Complexity

As shown in Tables 1 and 2 we compare our proposedmethod with some of the latest and most advanced processeson the Flickr30k andMicrosoft COCO datasets We can findthat our approach works better than the compared methodsDifferent fromDSPE+FVlowast that uses external text corpora tolearn discriminative sentence features our model learnsthem directly from scratch in an end-to-end manner Whencomparing among our methods we can conclude as follows(a) our attention scheme is sufficient since the model withattention consistently outperforms those without notice on

both datasets (b) using ResNet as the underlying network tounderstand the deep semantics of images to get the spatialrelation features with the help of text context relations

It takes 235 seconds to use a brute force algorithm and 18seconds to mark with a tree-based clustering algorithm onMicrosoft COCO datasets

442 Quantitative Evaluation We also report quantitativeevaluation results with the frequently used BLEUmetric [46]for the proposed datasets -e results for Microsoft COCOand Flickr30K datasets are listed in Tables 3 and 4-e imageencoders of all methods listed here are either VGG-Net orResNet which are prevalent in this field We also report theablation test in terms of discarding the spatial relationmodel

-e results demonstrate that our method has a betterperformance than discarding the spatial relation model

(a) (b)

(c) (d)

Figure 7 Examples of attending to the correct object (white indicates the attended regions words below the figure indicate the labelling ofspatial relation features) (a) Havest (b) Begging (c) Search (d) Salvage

Table 1 Results of caption-image retrieval on the Flickr30Kdataset

R1 R5 R10RVP (T+ I) [38] 119 277 477CDCCA [39] 168 393 530MNLM [40] 230 507 629BRNN [41] 222 482 614DSPE+ FVlowast [42] 403 689 799m-RNN [15] 339 651 763m-CNN [43] 346 647 768Softmax-VGG 218 507 617Softmax-ResNet 223 529 632Softmax-ResNet-Attention 356 654 756Ranking-ResNet-Attention 413 691 813lowastUsing external text corpora -e ensemble or multimodel methods

Table 2 Results of caption-image retrieval on theMicrosoft COCOdataset

R1 R5 R10DSPE+ FVlowast 494 761 864m-RNN 417 727 835Sg-LSTM [44] 428 739 847MNLM 439 757 858DVSA [45] 392 699 805m-CNN 431 739 841Softmax-VGG 475 749 843Softmax-ResNet 491 761 852Softmax-ResNet-Attention 503 775 875Ranking-ResNet-Attention 527 793 887lowastUsing external text corpora -e ensemble or multimodel methods

Table 3 -e results of different caption methods on the MicrosoftCOCO

Bleu-1 Bleu-2 Bleu-3 Bleu-4SCA-CNN [19] 0714 0543 0409 0309Semantic ATT [30] 0699 0415 0306 0232SCN-LSTM [20] 0726 0581 0421 0318Softmax-ResNet-Attention 0736 0587 0423 0325Ranking-ResNet-Attention 0741 0598 0438 0319Softmax-ResNet-Attentionlowast 0703 0524 0413 0311Ranking-ResNet-Attentionlowast 0705 0540 0416 0313lowastDiscarding spatial relation model

Table 4 -e results of different caption methods on the Flickr30K

Bleu-1 Bleu-2 Bleu-3 Bleu-4SCA-CNN [19] 0705 0515 0402 0304Semantic ATT [30] 0694 0427 0301 0224SCN-LSTM [20] 0718 0569 0413 0308Softmax-ResNet-Attention 0715 0573 0429 0325Ranking-ResNet-Attention 0729 0581 0416 0319Softmax-ResNet-Attentionlowast 0706 0533 0406 0301Ranking-ResNet-Attentionlowast 0712 0542 0418 0309lowastDiscarding spatial relation model

Complexity 9

Besides our approach is slightly better than the proposedstate-of-the-art methods which verifies the efficiency oftopic-condition in image-captioning Note that our modeluses ResNet-50 as the encoder which is a simple attentionmodel-us our approach is competitive with those models

5 Discussion and Conclusion

-is paper proposes an integrated model to recognize in-stances and objects jointly by leveraging the associatedtextual descriptions and presents a learning algorithm toestimate the model efficiently -e learning process requiresonly separate images and texts without high-level captionsWe use the residual network to deepen the learning of imagesemantic and combine with the text to obtain some hiddenrelation features contained in the picture By constructing ajoint inference module with self-attention we make a fusionof local and global elements We also show that integratingimages and text for deep semantic understanding to label thespatial relation features Furthermore the use of a treeclustering algorithm accelerates the matching process Ex-periments verify that the proposed method achieves com-petitive results on two generic annotation datasets Flickr30kand Microsoft COCO

Data Availability

-e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

-e authors declare no conflicts of interest

Acknowledgments

-e project was sponsored by the National Nature ScienceFoundation of China (41571401 and 41871226) and NaturalScience Foundation of Chongqing China (cstc2019jcyj-msxmX0131)

References

[1] J Zhu L Fang and P Ghamisi ldquoDeformable convolutionalneural networks for hyperspectral image classificationrdquo IEEEGeoscience and Remote Sensing Letters vol 15 no 8pp 1254ndash1258 2018

[2] W Lu J Li T Li W Guo H Zhang and J Guo ldquoWebmultimedia object classification using cross-domain corre-lation knowledgerdquo IEEE Transactions on Multimedia vol 15no 8 pp 1920ndash1929 2013

[3] M Katsurai T Ogawa and M Haseyama ldquoA cross-modalapproach for extracting semantic relationships betweenconcepts using tagged imagesrdquo IEEE Transactions on Mul-timedia vol 16 no 4 pp 1059ndash1074 2014

[4] R Krishna I Chami M Bernstein and L Fei-Fei ldquoReferringrelationshipsrdquo in Proceedings of the IEEECVF Conference onComputer Vision and Pattern Recognition pp 6867ndash6876 SaltLake City UT USA June 2018

[5] B Wang D Lin H Xiong and Y F Zheng ldquoJoint inferenceof objects and scenes with efficient learning of text-object-

scene relationsrdquo IEEE Transactions on Multimedia vol 18no 3 pp 507ndash520 2016

[6] K Fu J Jin R Cui F Sha and C Zhang ldquoAligning where tosee and what to tell image captioning with region-basedattention and scene-specific contextsrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 39 no 12pp 2321ndash2334 2017

[7] Q Wu C Shen L Liu A Dick and A van den HengelldquoWhat value do explicit high level concepts have in vision tolanguage problemsrdquo in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR) pp 203ndash212 Las Vegas NV USA June 2016

[8] M J Choi A Torralba and A S Willsky ldquoA tree-basedcontext model for object recognitionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 34 no 2pp 240ndash252 2012

[9] R Mottaghi X Chen X Liu et al ldquo-e role of context forobject detection and semantic segmentation in the wildrdquo inProceedings of the IEEE Conference on Computer Vision andPattern Recognition Columbus OH USA June 2014

[10] H Hu G-T Zhou Z Deng Z Liao and G Mori ldquoLearningstructured inference neural networks with label relationsrdquo inProceedings of the Computer Vision and Pattern Recognitionpp 2960ndash2968 Las Vegas NV USA June 2016

[11] J Johnson L Ballan and L Fei-Fei ldquoLove thy neighborsimage annotation by exploiting image metadatardquo in Pro-ceedings of the IEEE International Conference on ComputerVision pp 4624ndash4632 Tampa FL USA December 2015

[12] A Farhadi M Hejrati M A Sadeghi et al ldquoEvery picture tellsa story generating sentences from imagesrdquo in Proceedings ofthe European conference on computer vision Springer BerlinHeidelberg 2010

[13] S Li G Kulkarni T L Berg A C Berg and Y ChoildquoComposing simple image descriptions using web-scalen-gramsrdquo in Proceedings of the Conference on ComputationalNatural Language Learning Portland OR USA June 2011

[14] M Mitchell X Han J Dodge et al ldquoGenerating image de-scriptions from computer vision detectionsrdquo in Proceedings ofthe 13th Conference of the European Chapter of the Associationfor Computational Linguistics Avignon France 2012

[15] J Mao W Xu Y Yang J Wang Z Huang and A YuilleldquoDeep captioning with multimodal recurrent neural networks(m-RNN)rdquo in Proceedings of the ICLR pp 1ndash17 San DiegoCA USA May 2015

[16] H Hu G-T Zhou Z Deng Z Liao and G Mori ldquoLearningstructured inference neural networks with label relationsrdquo inProceedings of the CVPR pp 2960ndash2968 Las Vegas NV USAJune 2016

[17] F Liu T Xiang T M Hospedales W Yang and C SunldquoSemantic regularisation for recurrent image annotationrdquo2016 httpsarxivorgabs161105490

[18] Y Niu Z Lu J-R Wen T Xiang and S-F Chang ldquoMulti-modal multi-scale deep learning for large-scale image an-notationrdquo IEEE Transactions on Image Processing vol 28no 4 pp 1720ndash1731 2019

[19] L Chen H Zhang J Xiao et al ldquoSpatial and channel-wiseattention in convolutional networks for image captioningrdquo inProceedings of IEEE Conference Computer Vision and PatternRecognition pp 6298ndash6306 Honolulu Hawaii July 2017

[20] Z Gan ldquoSemantic compositional networks for visual cap-tioningrdquo in Proceedings of IEEE Conference Computer Visionand Pattern Recognition pp 1141ndash1150 Honolulu HI USAJuly 2017

10 Complexity

[21] A Santoro D Raposo D G Barrett et al A Simple NeuralNetwork Module for Relational Reasoning pp 4967-4976

[22] S Harnad ldquo-e symbol grounding problemrdquo Physica DNonlinear Phenomena vol 42 no 1ndash3 pp 335ndash346 1990

[23] M Garnelo K Arulkumaran and M Shanahan ldquoTowardsdeep symbolic reinforcement learningrdquo 2016 httparxivorgabs160905518

[24] B M Lake T D Ullman J B Tenenbaum andS J Gershman ldquoBuilding machines that learn and think likepeoplerdquo Behavioral and Brain Sciences vol 40 2017

[25] D Xu Y Zhu C B Choy and L Fei-Fei ldquoScene graphgeneration by iterative message passingrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognitionpp 3097ndash3106 Honolulu Hawaii July 2017

[26] Y Li W Ouyang B Zhou K Wang and X Wang ldquoScenegraph generation from objects phrases and region captionsrdquoin Proceedings of the IEEE International Conference onComputer Vision pp 1270ndash1279 Venice Italy October 2017

[27] I Yahav O Shehory and D Schwartz ldquoComments miningwith TF-IDF the inherent bias and its removalrdquo IEEETransactions on Knowledge and Data Engineering vol 31no 3 pp 437ndash450 2019

[28] K He X Zhang S Ren et al ldquoDeep residual learning forimage recognitionrdquo in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition pp 770ndash778 LasVegas NA USA July 2016

[29] F Schroff D Kalenichenko and J Philbin ldquoFaceNet a unifiedembedding for face recognition and clusteringrdquo in Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition pp 815ndash823 Boston MA USA 2015

[30] Q You H Jin Z Wang C Fang and J Luo ldquoImage cap-tioning with semantic attentionrdquo in Proceedings of IEEE theConference Computer Vision and Pattern Recognitionpp 4651ndash4659 Las Vegas NV USA June 2016

[31] D Bahdanau K Cho and Y Bengio ldquoNeural Machinetranslation by jointly learning to align and translaterdquo 2014httparxivorgabs14090473

[32] A Karpathy A Joulin and L F Fei-Fei ldquoDeep fragmentembeddings for bidirectional image sentence mappingrdquo inProceedings of the International Conference on Neural Infor-mation Processing Systems pp 1889ndash1897 Montreal Canada2014

[33] H Nam J Ha and J Kim ldquoDual attention networks formultimodal reasoning and matchingrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognitionpp 2156ndash2164 Honolulu HI USA July 2017

[34] P Young A Lai M Hodosh and J Hockenmaier ldquoFromimage descriptions to visual denotations new similaritymetrics for semantic inference over event descriptionsrdquoTransactions of the Association for Computational Linguisticsvol 2 pp 67ndash78 2014

[35] H Hu J Gu Z Zhang J Dai and Y Wei ldquoRelation networksfor object detectionrdquo in Proceedings of the IEEECVF Con-ference on Computer Vision and Pattern Recognitionpp 3588ndash3597 Salt Lake City UT USA June 2018

[36] P Felzenszwalb R Girshick D McAllester and D RamananldquoObject detection with discriminatively trained part basedmodelsrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 32 no 9 2010

[37] B Bahmani ldquoScalable K-Means++rdquo in Proceedings of the VldbEndowment 5(7) pp 622ndash633 Istanbul Turkey August 2012

[38] R Kiros R Salakhutdinov and R S Zemel ldquoUnifying visual-semantic embeddings with multimodal neural languagemodelsrdquo 2014 httparxivorgabs14112539

[39] Y Yu S Tang K Aizawa and A Aizawa ldquoCategory-baseddeep CCA for fine-grained venue discovery from multimodaldatardquo IEEE Transactions on Neural Networks and LearningSystems vol 30 no 4 pp 1250ndash1258 2019

[40] J Johnson A Alahi and L Fei-Fei ldquoPerceptual losses for real-time style transfer and super-resolutionrdquo Computer Vision-ECCV 2016 pp 694ndash711 2016

[41] M Sinecen ldquoComparison of genomic best linear unbiasedprediction and bayesian regularization neural networks for ge-nomic selectionrdquo IEEE Access vol 7 pp 79199ndash79210 2019

[42] L Wang Y Li and S Lazebnik ldquoLearning deep structure-preserving image-text embeddingsrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern RecognitionIEEE Las Vegas NA USA June 2016

[43] Y Y ZhangD S Zhou et al ldquoSingle-image crowd countingvia multi-column convolutional neural networkrdquo in Pro-ceedings of the Conference on Computer Vision and PatternRecognition pp 589ndash597 Las Vegas NV USA June 2016

[44] Y Xian and Y Tian ldquoSelf-guiding multimodal LSTM-whenwe do not have a perfect training dataset for image cap-tioningrdquo IEEE Transactions on Image Processing vol 28no 11 pp 5241ndash5252 2019

[45] A Karpathy and L Fei-Fei ldquoDeep visual-semantic alignmentsfor generating image descriptionsrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 39 no 4pp 664ndash676 2017

[46] K Papineni S Roukos T Ward and W-J Zhu ldquoBLEUAmethod for automatic evaluation of machine translationrdquo inProceedings of the Conference Association for ComputationalLinguistics Philadelphia PA USA 2002

Complexity 11

Page 7: Image-TextJointLearningforSocialImageswithSpatial ...Sep 27, 2019  · two scenarios are straightforward: given an input image (resp.sentence),thegoalistofindtherelationshipsbetween

Raw images input

Image features frominput image by ResNet

Feature clustering

Stopping criteria

Recurrent clustering

Saving fingerprints

N

Y

(a)

Image features frominput image by ResNet

Class prediction

Leaf node

Fingerprints asoutput

N

Y

Raw image input

Search for thesimilar fingerprint

Order by thecomputed distance

(b)

Figure 3 Flowchart of training and searching steps for tree-based algorithm (a) Flowchart of training steps for tree-based algorithm (b)Flowchart of searching steps for tree-based algorithm

Input M an social imageD the current training instancesN cluster numbersR a set of features to represent an imageFP fingerprints for the imageT cluster tree

Output the fingerprints for M(1) Initialize N R T FP and resize M(2) compute R by ResNet to represent M(3) Repeat(4) create a node v(5) update R according to N clusters to M with k-means++(6) let Dj be the set of data in D satisfying outcome j(7) attach the node returned by T(Dj) to node v(8) until Stop(D R) is true(9) return FP as a finger for the image and attach leaf nodes to FP

ALGORITHM 1 Build fingerprints with a cluster tree

Input M an social imageMD image dataset for searchN number of the image datasetFP fingerprint for the image

Output S top-k similar images for image M(1) Initialize M MD FP S and resize images(2) M and MD is represented as R by ResNet(3) Checking procedure to find the top-k similar images to M in MD(4) For each image from MD get a cosine similarity by FP of the image pair(5) return top-k similar images order by cosine similarity

ALGORITHM 2 Find top-k similar images by fingerprints

Complexity 7

(a)

0

20

40

60

80

0 20 40 60 80 100 120

(b)

Figure 4 -e visualization of ResNet-50 (a) After 50 layers of convolution (b) Fusion by 1 1 ratio

20

25

30

35

40

50

45

0 10 20 30 40 50Iter lowast 10

Reca

ll

Ranking loss

Center lossSomax

Figure 5 Recall curves when training on Microsoft COCO

A ndash020Rock 071Climber 029Climbs 032In ndash012Between 003Two ndash013Very ndash007Large 008Rocks 073

(a)

A ndash015Group 006Of ndash009Mountain 067Climbers 051Rests 054At ndash011e ndash008Summit 018

(b)

Two 003Boys 035In ndash009A ndash012Field 057Kicking 041A ndash012Soccer 022Ball 019

(c)

A ndash017Man 046Surfs 054In ndash012e ndash026Water 021

(d)

Figure 6 Similarity matching between image and text

8 Complexity

As shown in Tables 1 and 2 we compare our proposedmethod with some of the latest and most advanced processeson the Flickr30k andMicrosoft COCO datasets We can findthat our approach works better than the compared methodsDifferent fromDSPE+FVlowast that uses external text corpora tolearn discriminative sentence features our model learnsthem directly from scratch in an end-to-end manner Whencomparing among our methods we can conclude as follows(a) our attention scheme is sufficient since the model withattention consistently outperforms those without notice on

both datasets (b) using ResNet as the underlying network tounderstand the deep semantics of images to get the spatialrelation features with the help of text context relations

It takes 235 seconds to use a brute force algorithm and 18seconds to mark with a tree-based clustering algorithm onMicrosoft COCO datasets

442 Quantitative Evaluation We also report quantitativeevaluation results with the frequently used BLEUmetric [46]for the proposed datasets -e results for Microsoft COCOand Flickr30K datasets are listed in Tables 3 and 4-e imageencoders of all methods listed here are either VGG-Net orResNet which are prevalent in this field We also report theablation test in terms of discarding the spatial relationmodel

-e results demonstrate that our method has a betterperformance than discarding the spatial relation model

(a) (b)

(c) (d)

Figure 7 Examples of attending to the correct object (white indicates the attended regions words below the figure indicate the labelling ofspatial relation features) (a) Havest (b) Begging (c) Search (d) Salvage

Table 1 Results of caption-image retrieval on the Flickr30Kdataset

R1 R5 R10RVP (T+ I) [38] 119 277 477CDCCA [39] 168 393 530MNLM [40] 230 507 629BRNN [41] 222 482 614DSPE+ FVlowast [42] 403 689 799m-RNN [15] 339 651 763m-CNN [43] 346 647 768Softmax-VGG 218 507 617Softmax-ResNet 223 529 632Softmax-ResNet-Attention 356 654 756Ranking-ResNet-Attention 413 691 813lowastUsing external text corpora -e ensemble or multimodel methods

Table 2 Results of caption-image retrieval on theMicrosoft COCOdataset

R1 R5 R10DSPE+ FVlowast 494 761 864m-RNN 417 727 835Sg-LSTM [44] 428 739 847MNLM 439 757 858DVSA [45] 392 699 805m-CNN 431 739 841Softmax-VGG 475 749 843Softmax-ResNet 491 761 852Softmax-ResNet-Attention 503 775 875Ranking-ResNet-Attention 527 793 887lowastUsing external text corpora -e ensemble or multimodel methods

Table 3 -e results of different caption methods on the MicrosoftCOCO

Bleu-1 Bleu-2 Bleu-3 Bleu-4SCA-CNN [19] 0714 0543 0409 0309Semantic ATT [30] 0699 0415 0306 0232SCN-LSTM [20] 0726 0581 0421 0318Softmax-ResNet-Attention 0736 0587 0423 0325Ranking-ResNet-Attention 0741 0598 0438 0319Softmax-ResNet-Attentionlowast 0703 0524 0413 0311Ranking-ResNet-Attentionlowast 0705 0540 0416 0313lowastDiscarding spatial relation model

Table 4 -e results of different caption methods on the Flickr30K

Bleu-1 Bleu-2 Bleu-3 Bleu-4SCA-CNN [19] 0705 0515 0402 0304Semantic ATT [30] 0694 0427 0301 0224SCN-LSTM [20] 0718 0569 0413 0308Softmax-ResNet-Attention 0715 0573 0429 0325Ranking-ResNet-Attention 0729 0581 0416 0319Softmax-ResNet-Attentionlowast 0706 0533 0406 0301Ranking-ResNet-Attentionlowast 0712 0542 0418 0309lowastDiscarding spatial relation model

Complexity 9

Besides our approach is slightly better than the proposedstate-of-the-art methods which verifies the efficiency oftopic-condition in image-captioning Note that our modeluses ResNet-50 as the encoder which is a simple attentionmodel-us our approach is competitive with those models

5 Discussion and Conclusion

-is paper proposes an integrated model to recognize in-stances and objects jointly by leveraging the associatedtextual descriptions and presents a learning algorithm toestimate the model efficiently -e learning process requiresonly separate images and texts without high-level captionsWe use the residual network to deepen the learning of imagesemantic and combine with the text to obtain some hiddenrelation features contained in the picture By constructing ajoint inference module with self-attention we make a fusionof local and global elements We also show that integratingimages and text for deep semantic understanding to label thespatial relation features Furthermore the use of a treeclustering algorithm accelerates the matching process Ex-periments verify that the proposed method achieves com-petitive results on two generic annotation datasets Flickr30kand Microsoft COCO

Data Availability

-e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

-e authors declare no conflicts of interest

Acknowledgments

-e project was sponsored by the National Nature ScienceFoundation of China (41571401 and 41871226) and NaturalScience Foundation of Chongqing China (cstc2019jcyj-msxmX0131)

References

[1] J Zhu L Fang and P Ghamisi ldquoDeformable convolutionalneural networks for hyperspectral image classificationrdquo IEEEGeoscience and Remote Sensing Letters vol 15 no 8pp 1254ndash1258 2018

[2] W Lu J Li T Li W Guo H Zhang and J Guo ldquoWebmultimedia object classification using cross-domain corre-lation knowledgerdquo IEEE Transactions on Multimedia vol 15no 8 pp 1920ndash1929 2013

[3] M Katsurai T Ogawa and M Haseyama ldquoA cross-modalapproach for extracting semantic relationships betweenconcepts using tagged imagesrdquo IEEE Transactions on Mul-timedia vol 16 no 4 pp 1059ndash1074 2014

[4] R Krishna I Chami M Bernstein and L Fei-Fei ldquoReferringrelationshipsrdquo in Proceedings of the IEEECVF Conference onComputer Vision and Pattern Recognition pp 6867ndash6876 SaltLake City UT USA June 2018

[5] B Wang D Lin H Xiong and Y F Zheng ldquoJoint inferenceof objects and scenes with efficient learning of text-object-

scene relationsrdquo IEEE Transactions on Multimedia vol 18no 3 pp 507ndash520 2016

[6] K Fu J Jin R Cui F Sha and C Zhang ldquoAligning where tosee and what to tell image captioning with region-basedattention and scene-specific contextsrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 39 no 12pp 2321ndash2334 2017

[7] Q Wu C Shen L Liu A Dick and A van den HengelldquoWhat value do explicit high level concepts have in vision tolanguage problemsrdquo in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR) pp 203ndash212 Las Vegas NV USA June 2016

[8] M J Choi A Torralba and A S Willsky ldquoA tree-basedcontext model for object recognitionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 34 no 2pp 240ndash252 2012

[9] R Mottaghi X Chen X Liu et al ldquo-e role of context forobject detection and semantic segmentation in the wildrdquo inProceedings of the IEEE Conference on Computer Vision andPattern Recognition Columbus OH USA June 2014

[10] H Hu G-T Zhou Z Deng Z Liao and G Mori ldquoLearningstructured inference neural networks with label relationsrdquo inProceedings of the Computer Vision and Pattern Recognitionpp 2960ndash2968 Las Vegas NV USA June 2016

[11] J Johnson L Ballan and L Fei-Fei ldquoLove thy neighborsimage annotation by exploiting image metadatardquo in Pro-ceedings of the IEEE International Conference on ComputerVision pp 4624ndash4632 Tampa FL USA December 2015

[12] A Farhadi M Hejrati M A Sadeghi et al ldquoEvery picture tellsa story generating sentences from imagesrdquo in Proceedings ofthe European conference on computer vision Springer BerlinHeidelberg 2010

[13] S Li G Kulkarni T L Berg A C Berg and Y ChoildquoComposing simple image descriptions using web-scalen-gramsrdquo in Proceedings of the Conference on ComputationalNatural Language Learning Portland OR USA June 2011

[14] M Mitchell X Han J Dodge et al ldquoGenerating image de-scriptions from computer vision detectionsrdquo in Proceedings ofthe 13th Conference of the European Chapter of the Associationfor Computational Linguistics Avignon France 2012

[15] J Mao W Xu Y Yang J Wang Z Huang and A YuilleldquoDeep captioning with multimodal recurrent neural networks(m-RNN)rdquo in Proceedings of the ICLR pp 1ndash17 San DiegoCA USA May 2015

[16] H Hu G-T Zhou Z Deng Z Liao and G Mori ldquoLearningstructured inference neural networks with label relationsrdquo inProceedings of the CVPR pp 2960ndash2968 Las Vegas NV USAJune 2016

[17] F Liu T Xiang T M Hospedales W Yang and C SunldquoSemantic regularisation for recurrent image annotationrdquo2016 httpsarxivorgabs161105490

[18] Y Niu Z Lu J-R Wen T Xiang and S-F Chang ldquoMulti-modal multi-scale deep learning for large-scale image an-notationrdquo IEEE Transactions on Image Processing vol 28no 4 pp 1720ndash1731 2019

[19] L Chen H Zhang J Xiao et al ldquoSpatial and channel-wiseattention in convolutional networks for image captioningrdquo inProceedings of IEEE Conference Computer Vision and PatternRecognition pp 6298ndash6306 Honolulu Hawaii July 2017

[20] Z Gan ldquoSemantic compositional networks for visual cap-tioningrdquo in Proceedings of IEEE Conference Computer Visionand Pattern Recognition pp 1141ndash1150 Honolulu HI USAJuly 2017

10 Complexity

[21] A Santoro D Raposo D G Barrett et al A Simple NeuralNetwork Module for Relational Reasoning pp 4967-4976

[22] S Harnad ldquo-e symbol grounding problemrdquo Physica DNonlinear Phenomena vol 42 no 1ndash3 pp 335ndash346 1990

[23] M Garnelo K Arulkumaran and M Shanahan ldquoTowardsdeep symbolic reinforcement learningrdquo 2016 httparxivorgabs160905518

[24] B M Lake T D Ullman J B Tenenbaum andS J Gershman ldquoBuilding machines that learn and think likepeoplerdquo Behavioral and Brain Sciences vol 40 2017

[25] D Xu Y Zhu C B Choy and L Fei-Fei ldquoScene graphgeneration by iterative message passingrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognitionpp 3097ndash3106 Honolulu Hawaii July 2017

[26] Y Li W Ouyang B Zhou K Wang and X Wang ldquoScenegraph generation from objects phrases and region captionsrdquoin Proceedings of the IEEE International Conference onComputer Vision pp 1270ndash1279 Venice Italy October 2017

[27] I Yahav O Shehory and D Schwartz ldquoComments miningwith TF-IDF the inherent bias and its removalrdquo IEEETransactions on Knowledge and Data Engineering vol 31no 3 pp 437ndash450 2019

[28] K He X Zhang S Ren et al ldquoDeep residual learning forimage recognitionrdquo in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition pp 770ndash778 LasVegas NA USA July 2016

[29] F Schroff D Kalenichenko and J Philbin ldquoFaceNet a unifiedembedding for face recognition and clusteringrdquo in Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition pp 815ndash823 Boston MA USA 2015

[30] Q You H Jin Z Wang C Fang and J Luo ldquoImage cap-tioning with semantic attentionrdquo in Proceedings of IEEE theConference Computer Vision and Pattern Recognitionpp 4651ndash4659 Las Vegas NV USA June 2016

[31] D Bahdanau K Cho and Y Bengio ldquoNeural Machinetranslation by jointly learning to align and translaterdquo 2014httparxivorgabs14090473

[32] A Karpathy A Joulin and L F Fei-Fei ldquoDeep fragmentembeddings for bidirectional image sentence mappingrdquo inProceedings of the International Conference on Neural Infor-mation Processing Systems pp 1889ndash1897 Montreal Canada2014

[33] H Nam J Ha and J Kim ldquoDual attention networks formultimodal reasoning and matchingrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognitionpp 2156ndash2164 Honolulu HI USA July 2017

[34] P Young A Lai M Hodosh and J Hockenmaier ldquoFromimage descriptions to visual denotations new similaritymetrics for semantic inference over event descriptionsrdquoTransactions of the Association for Computational Linguisticsvol 2 pp 67ndash78 2014

[35] H Hu J Gu Z Zhang J Dai and Y Wei ldquoRelation networksfor object detectionrdquo in Proceedings of the IEEECVF Con-ference on Computer Vision and Pattern Recognitionpp 3588ndash3597 Salt Lake City UT USA June 2018

[36] P Felzenszwalb R Girshick D McAllester and D RamananldquoObject detection with discriminatively trained part basedmodelsrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 32 no 9 2010

[37] B Bahmani ldquoScalable K-Means++rdquo in Proceedings of the VldbEndowment 5(7) pp 622ndash633 Istanbul Turkey August 2012

[38] R Kiros R Salakhutdinov and R S Zemel ldquoUnifying visual-semantic embeddings with multimodal neural languagemodelsrdquo 2014 httparxivorgabs14112539

[39] Y Yu S Tang K Aizawa and A Aizawa ldquoCategory-baseddeep CCA for fine-grained venue discovery from multimodaldatardquo IEEE Transactions on Neural Networks and LearningSystems vol 30 no 4 pp 1250ndash1258 2019

[40] J Johnson A Alahi and L Fei-Fei ldquoPerceptual losses for real-time style transfer and super-resolutionrdquo Computer Vision-ECCV 2016 pp 694ndash711 2016

[41] M Sinecen ldquoComparison of genomic best linear unbiasedprediction and bayesian regularization neural networks for ge-nomic selectionrdquo IEEE Access vol 7 pp 79199ndash79210 2019

[42] L Wang Y Li and S Lazebnik ldquoLearning deep structure-preserving image-text embeddingsrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern RecognitionIEEE Las Vegas NA USA June 2016

[43] Y Y ZhangD S Zhou et al ldquoSingle-image crowd countingvia multi-column convolutional neural networkrdquo in Pro-ceedings of the Conference on Computer Vision and PatternRecognition pp 589ndash597 Las Vegas NV USA June 2016

[44] Y Xian and Y Tian ldquoSelf-guiding multimodal LSTM-whenwe do not have a perfect training dataset for image cap-tioningrdquo IEEE Transactions on Image Processing vol 28no 11 pp 5241ndash5252 2019

[45] A Karpathy and L Fei-Fei ldquoDeep visual-semantic alignmentsfor generating image descriptionsrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 39 no 4pp 664ndash676 2017

[46] K Papineni S Roukos T Ward and W-J Zhu ldquoBLEUAmethod for automatic evaluation of machine translationrdquo inProceedings of the Conference Association for ComputationalLinguistics Philadelphia PA USA 2002

Complexity 11

Page 8: Image-TextJointLearningforSocialImageswithSpatial ...Sep 27, 2019  · two scenarios are straightforward: given an input image (resp.sentence),thegoalistofindtherelationshipsbetween

(a)

0

20

40

60

80

0 20 40 60 80 100 120

(b)

Figure 4 -e visualization of ResNet-50 (a) After 50 layers of convolution (b) Fusion by 1 1 ratio

20

25

30

35

40

50

45

0 10 20 30 40 50Iter lowast 10

Reca

ll

Ranking loss

Center lossSomax

Figure 5 Recall curves when training on Microsoft COCO

A ndash020Rock 071Climber 029Climbs 032In ndash012Between 003Two ndash013Very ndash007Large 008Rocks 073

(a)

A ndash015Group 006Of ndash009Mountain 067Climbers 051Rests 054At ndash011e ndash008Summit 018

(b)

Two 003Boys 035In ndash009A ndash012Field 057Kicking 041A ndash012Soccer 022Ball 019

(c)

A ndash017Man 046Surfs 054In ndash012e ndash026Water 021

(d)

Figure 6 Similarity matching between image and text

8 Complexity

As shown in Tables 1 and 2 we compare our proposedmethod with some of the latest and most advanced processeson the Flickr30k andMicrosoft COCO datasets We can findthat our approach works better than the compared methodsDifferent fromDSPE+FVlowast that uses external text corpora tolearn discriminative sentence features our model learnsthem directly from scratch in an end-to-end manner Whencomparing among our methods we can conclude as follows(a) our attention scheme is sufficient since the model withattention consistently outperforms those without notice on

both datasets (b) using ResNet as the underlying network tounderstand the deep semantics of images to get the spatialrelation features with the help of text context relations

It takes 235 seconds to use a brute force algorithm and 18seconds to mark with a tree-based clustering algorithm onMicrosoft COCO datasets

442 Quantitative Evaluation We also report quantitativeevaluation results with the frequently used BLEUmetric [46]for the proposed datasets -e results for Microsoft COCOand Flickr30K datasets are listed in Tables 3 and 4-e imageencoders of all methods listed here are either VGG-Net orResNet which are prevalent in this field We also report theablation test in terms of discarding the spatial relationmodel

-e results demonstrate that our method has a betterperformance than discarding the spatial relation model

(a) (b)

(c) (d)

Figure 7 Examples of attending to the correct object (white indicates the attended regions words below the figure indicate the labelling ofspatial relation features) (a) Havest (b) Begging (c) Search (d) Salvage

Table 1 Results of caption-image retrieval on the Flickr30Kdataset

R1 R5 R10RVP (T+ I) [38] 119 277 477CDCCA [39] 168 393 530MNLM [40] 230 507 629BRNN [41] 222 482 614DSPE+ FVlowast [42] 403 689 799m-RNN [15] 339 651 763m-CNN [43] 346 647 768Softmax-VGG 218 507 617Softmax-ResNet 223 529 632Softmax-ResNet-Attention 356 654 756Ranking-ResNet-Attention 413 691 813lowastUsing external text corpora -e ensemble or multimodel methods

Table 2 Results of caption-image retrieval on theMicrosoft COCOdataset

R1 R5 R10DSPE+ FVlowast 494 761 864m-RNN 417 727 835Sg-LSTM [44] 428 739 847MNLM 439 757 858DVSA [45] 392 699 805m-CNN 431 739 841Softmax-VGG 475 749 843Softmax-ResNet 491 761 852Softmax-ResNet-Attention 503 775 875Ranking-ResNet-Attention 527 793 887lowastUsing external text corpora -e ensemble or multimodel methods

Table 3 -e results of different caption methods on the MicrosoftCOCO

Bleu-1 Bleu-2 Bleu-3 Bleu-4SCA-CNN [19] 0714 0543 0409 0309Semantic ATT [30] 0699 0415 0306 0232SCN-LSTM [20] 0726 0581 0421 0318Softmax-ResNet-Attention 0736 0587 0423 0325Ranking-ResNet-Attention 0741 0598 0438 0319Softmax-ResNet-Attentionlowast 0703 0524 0413 0311Ranking-ResNet-Attentionlowast 0705 0540 0416 0313lowastDiscarding spatial relation model

Table 4 -e results of different caption methods on the Flickr30K

Bleu-1 Bleu-2 Bleu-3 Bleu-4SCA-CNN [19] 0705 0515 0402 0304Semantic ATT [30] 0694 0427 0301 0224SCN-LSTM [20] 0718 0569 0413 0308Softmax-ResNet-Attention 0715 0573 0429 0325Ranking-ResNet-Attention 0729 0581 0416 0319Softmax-ResNet-Attentionlowast 0706 0533 0406 0301Ranking-ResNet-Attentionlowast 0712 0542 0418 0309lowastDiscarding spatial relation model

Complexity 9

Besides our approach is slightly better than the proposedstate-of-the-art methods which verifies the efficiency oftopic-condition in image-captioning Note that our modeluses ResNet-50 as the encoder which is a simple attentionmodel-us our approach is competitive with those models

5 Discussion and Conclusion

-is paper proposes an integrated model to recognize in-stances and objects jointly by leveraging the associatedtextual descriptions and presents a learning algorithm toestimate the model efficiently -e learning process requiresonly separate images and texts without high-level captionsWe use the residual network to deepen the learning of imagesemantic and combine with the text to obtain some hiddenrelation features contained in the picture By constructing ajoint inference module with self-attention we make a fusionof local and global elements We also show that integratingimages and text for deep semantic understanding to label thespatial relation features Furthermore the use of a treeclustering algorithm accelerates the matching process Ex-periments verify that the proposed method achieves com-petitive results on two generic annotation datasets Flickr30kand Microsoft COCO

Data Availability

-e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

-e authors declare no conflicts of interest

Acknowledgments

-e project was sponsored by the National Nature ScienceFoundation of China (41571401 and 41871226) and NaturalScience Foundation of Chongqing China (cstc2019jcyj-msxmX0131)

References

[1] J Zhu L Fang and P Ghamisi ldquoDeformable convolutionalneural networks for hyperspectral image classificationrdquo IEEEGeoscience and Remote Sensing Letters vol 15 no 8pp 1254ndash1258 2018

[2] W Lu J Li T Li W Guo H Zhang and J Guo ldquoWebmultimedia object classification using cross-domain corre-lation knowledgerdquo IEEE Transactions on Multimedia vol 15no 8 pp 1920ndash1929 2013

[3] M Katsurai T Ogawa and M Haseyama ldquoA cross-modalapproach for extracting semantic relationships betweenconcepts using tagged imagesrdquo IEEE Transactions on Mul-timedia vol 16 no 4 pp 1059ndash1074 2014

[4] R Krishna I Chami M Bernstein and L Fei-Fei ldquoReferringrelationshipsrdquo in Proceedings of the IEEECVF Conference onComputer Vision and Pattern Recognition pp 6867ndash6876 SaltLake City UT USA June 2018

[5] B Wang D Lin H Xiong and Y F Zheng ldquoJoint inferenceof objects and scenes with efficient learning of text-object-

scene relationsrdquo IEEE Transactions on Multimedia vol 18no 3 pp 507ndash520 2016

[6] K Fu J Jin R Cui F Sha and C Zhang ldquoAligning where tosee and what to tell image captioning with region-basedattention and scene-specific contextsrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 39 no 12pp 2321ndash2334 2017

[7] Q Wu C Shen L Liu A Dick and A van den HengelldquoWhat value do explicit high level concepts have in vision tolanguage problemsrdquo in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR) pp 203ndash212 Las Vegas NV USA June 2016

[8] M J Choi A Torralba and A S Willsky ldquoA tree-basedcontext model for object recognitionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 34 no 2pp 240ndash252 2012

[9] R Mottaghi X Chen X Liu et al ldquo-e role of context forobject detection and semantic segmentation in the wildrdquo inProceedings of the IEEE Conference on Computer Vision andPattern Recognition Columbus OH USA June 2014

[10] H Hu G-T Zhou Z Deng Z Liao and G Mori ldquoLearningstructured inference neural networks with label relationsrdquo inProceedings of the Computer Vision and Pattern Recognitionpp 2960ndash2968 Las Vegas NV USA June 2016

[11] J Johnson L Ballan and L Fei-Fei ldquoLove thy neighborsimage annotation by exploiting image metadatardquo in Pro-ceedings of the IEEE International Conference on ComputerVision pp 4624ndash4632 Tampa FL USA December 2015

[12] A Farhadi M Hejrati M A Sadeghi et al ldquoEvery picture tellsa story generating sentences from imagesrdquo in Proceedings ofthe European conference on computer vision Springer BerlinHeidelberg 2010

[13] S Li G Kulkarni T L Berg A C Berg and Y ChoildquoComposing simple image descriptions using web-scalen-gramsrdquo in Proceedings of the Conference on ComputationalNatural Language Learning Portland OR USA June 2011

[14] M Mitchell X Han J Dodge et al ldquoGenerating image de-scriptions from computer vision detectionsrdquo in Proceedings ofthe 13th Conference of the European Chapter of the Associationfor Computational Linguistics Avignon France 2012

[15] J Mao W Xu Y Yang J Wang Z Huang and A YuilleldquoDeep captioning with multimodal recurrent neural networks(m-RNN)rdquo in Proceedings of the ICLR pp 1ndash17 San DiegoCA USA May 2015

[16] H Hu G-T Zhou Z Deng Z Liao and G Mori ldquoLearningstructured inference neural networks with label relationsrdquo inProceedings of the CVPR pp 2960ndash2968 Las Vegas NV USAJune 2016

[17] F Liu T Xiang T M Hospedales W Yang and C SunldquoSemantic regularisation for recurrent image annotationrdquo2016 httpsarxivorgabs161105490

[18] Y Niu Z Lu J-R Wen T Xiang and S-F Chang ldquoMulti-modal multi-scale deep learning for large-scale image an-notationrdquo IEEE Transactions on Image Processing vol 28no 4 pp 1720ndash1731 2019

[19] L Chen H Zhang J Xiao et al ldquoSpatial and channel-wiseattention in convolutional networks for image captioningrdquo inProceedings of IEEE Conference Computer Vision and PatternRecognition pp 6298ndash6306 Honolulu Hawaii July 2017

[20] Z Gan ldquoSemantic compositional networks for visual cap-tioningrdquo in Proceedings of IEEE Conference Computer Visionand Pattern Recognition pp 1141ndash1150 Honolulu HI USAJuly 2017

10 Complexity

[21] A Santoro D Raposo D G Barrett et al A Simple NeuralNetwork Module for Relational Reasoning pp 4967-4976

[22] S Harnad ldquo-e symbol grounding problemrdquo Physica DNonlinear Phenomena vol 42 no 1ndash3 pp 335ndash346 1990

[23] M Garnelo K Arulkumaran and M Shanahan ldquoTowardsdeep symbolic reinforcement learningrdquo 2016 httparxivorgabs160905518

[24] B M Lake T D Ullman J B Tenenbaum andS J Gershman ldquoBuilding machines that learn and think likepeoplerdquo Behavioral and Brain Sciences vol 40 2017

[25] D Xu Y Zhu C B Choy and L Fei-Fei ldquoScene graphgeneration by iterative message passingrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognitionpp 3097ndash3106 Honolulu Hawaii July 2017

[26] Y Li W Ouyang B Zhou K Wang and X Wang ldquoScenegraph generation from objects phrases and region captionsrdquoin Proceedings of the IEEE International Conference onComputer Vision pp 1270ndash1279 Venice Italy October 2017

[27] I Yahav O Shehory and D Schwartz ldquoComments miningwith TF-IDF the inherent bias and its removalrdquo IEEETransactions on Knowledge and Data Engineering vol 31no 3 pp 437ndash450 2019

[28] K He X Zhang S Ren et al ldquoDeep residual learning forimage recognitionrdquo in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition pp 770ndash778 LasVegas NA USA July 2016

[29] F Schroff D Kalenichenko and J Philbin ldquoFaceNet a unifiedembedding for face recognition and clusteringrdquo in Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition pp 815ndash823 Boston MA USA 2015

[30] Q You H Jin Z Wang C Fang and J Luo ldquoImage cap-tioning with semantic attentionrdquo in Proceedings of IEEE theConference Computer Vision and Pattern Recognitionpp 4651ndash4659 Las Vegas NV USA June 2016

[31] D Bahdanau K Cho and Y Bengio ldquoNeural Machinetranslation by jointly learning to align and translaterdquo 2014httparxivorgabs14090473

[32] A Karpathy A Joulin and L F Fei-Fei ldquoDeep fragmentembeddings for bidirectional image sentence mappingrdquo inProceedings of the International Conference on Neural Infor-mation Processing Systems pp 1889ndash1897 Montreal Canada2014

[33] H Nam J Ha and J Kim ldquoDual attention networks formultimodal reasoning and matchingrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognitionpp 2156ndash2164 Honolulu HI USA July 2017

[34] P Young A Lai M Hodosh and J Hockenmaier ldquoFromimage descriptions to visual denotations new similaritymetrics for semantic inference over event descriptionsrdquoTransactions of the Association for Computational Linguisticsvol 2 pp 67ndash78 2014

[35] H Hu J Gu Z Zhang J Dai and Y Wei ldquoRelation networksfor object detectionrdquo in Proceedings of the IEEECVF Con-ference on Computer Vision and Pattern Recognitionpp 3588ndash3597 Salt Lake City UT USA June 2018

[36] P Felzenszwalb R Girshick D McAllester and D RamananldquoObject detection with discriminatively trained part basedmodelsrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 32 no 9 2010

[37] B Bahmani ldquoScalable K-Means++rdquo in Proceedings of the VldbEndowment 5(7) pp 622ndash633 Istanbul Turkey August 2012

[38] R Kiros R Salakhutdinov and R S Zemel ldquoUnifying visual-semantic embeddings with multimodal neural languagemodelsrdquo 2014 httparxivorgabs14112539

[39] Y Yu S Tang K Aizawa and A Aizawa ldquoCategory-baseddeep CCA for fine-grained venue discovery from multimodaldatardquo IEEE Transactions on Neural Networks and LearningSystems vol 30 no 4 pp 1250ndash1258 2019

[40] J Johnson A Alahi and L Fei-Fei ldquoPerceptual losses for real-time style transfer and super-resolutionrdquo Computer Vision-ECCV 2016 pp 694ndash711 2016

[41] M Sinecen ldquoComparison of genomic best linear unbiasedprediction and bayesian regularization neural networks for ge-nomic selectionrdquo IEEE Access vol 7 pp 79199ndash79210 2019

[42] L Wang Y Li and S Lazebnik ldquoLearning deep structure-preserving image-text embeddingsrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern RecognitionIEEE Las Vegas NA USA June 2016

[43] Y Y ZhangD S Zhou et al ldquoSingle-image crowd countingvia multi-column convolutional neural networkrdquo in Pro-ceedings of the Conference on Computer Vision and PatternRecognition pp 589ndash597 Las Vegas NV USA June 2016

[44] Y Xian and Y Tian ldquoSelf-guiding multimodal LSTM-whenwe do not have a perfect training dataset for image cap-tioningrdquo IEEE Transactions on Image Processing vol 28no 11 pp 5241ndash5252 2019

[45] A Karpathy and L Fei-Fei ldquoDeep visual-semantic alignmentsfor generating image descriptionsrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 39 no 4pp 664ndash676 2017

[46] K Papineni S Roukos T Ward and W-J Zhu ldquoBLEUAmethod for automatic evaluation of machine translationrdquo inProceedings of the Conference Association for ComputationalLinguistics Philadelphia PA USA 2002

Complexity 11

Page 9: Image-TextJointLearningforSocialImageswithSpatial ...Sep 27, 2019  · two scenarios are straightforward: given an input image (resp.sentence),thegoalistofindtherelationshipsbetween

As shown in Tables 1 and 2 we compare our proposedmethod with some of the latest and most advanced processeson the Flickr30k andMicrosoft COCO datasets We can findthat our approach works better than the compared methodsDifferent fromDSPE+FVlowast that uses external text corpora tolearn discriminative sentence features our model learnsthem directly from scratch in an end-to-end manner Whencomparing among our methods we can conclude as follows(a) our attention scheme is sufficient since the model withattention consistently outperforms those without notice on

both datasets (b) using ResNet as the underlying network tounderstand the deep semantics of images to get the spatialrelation features with the help of text context relations

It takes 235 seconds to use a brute force algorithm and 18seconds to mark with a tree-based clustering algorithm onMicrosoft COCO datasets

442 Quantitative Evaluation We also report quantitativeevaluation results with the frequently used BLEUmetric [46]for the proposed datasets -e results for Microsoft COCOand Flickr30K datasets are listed in Tables 3 and 4-e imageencoders of all methods listed here are either VGG-Net orResNet which are prevalent in this field We also report theablation test in terms of discarding the spatial relationmodel

-e results demonstrate that our method has a betterperformance than discarding the spatial relation model

(a) (b)

(c) (d)

Figure 7 Examples of attending to the correct object (white indicates the attended regions words below the figure indicate the labelling ofspatial relation features) (a) Havest (b) Begging (c) Search (d) Salvage

Table 1 Results of caption-image retrieval on the Flickr30Kdataset

R1 R5 R10RVP (T+ I) [38] 119 277 477CDCCA [39] 168 393 530MNLM [40] 230 507 629BRNN [41] 222 482 614DSPE+ FVlowast [42] 403 689 799m-RNN [15] 339 651 763m-CNN [43] 346 647 768Softmax-VGG 218 507 617Softmax-ResNet 223 529 632Softmax-ResNet-Attention 356 654 756Ranking-ResNet-Attention 413 691 813lowastUsing external text corpora -e ensemble or multimodel methods

Table 2 Results of caption-image retrieval on theMicrosoft COCOdataset

R1 R5 R10DSPE+ FVlowast 494 761 864m-RNN 417 727 835Sg-LSTM [44] 428 739 847MNLM 439 757 858DVSA [45] 392 699 805m-CNN 431 739 841Softmax-VGG 475 749 843Softmax-ResNet 491 761 852Softmax-ResNet-Attention 503 775 875Ranking-ResNet-Attention 527 793 887lowastUsing external text corpora -e ensemble or multimodel methods

Table 3 -e results of different caption methods on the MicrosoftCOCO

Bleu-1 Bleu-2 Bleu-3 Bleu-4SCA-CNN [19] 0714 0543 0409 0309Semantic ATT [30] 0699 0415 0306 0232SCN-LSTM [20] 0726 0581 0421 0318Softmax-ResNet-Attention 0736 0587 0423 0325Ranking-ResNet-Attention 0741 0598 0438 0319Softmax-ResNet-Attentionlowast 0703 0524 0413 0311Ranking-ResNet-Attentionlowast 0705 0540 0416 0313lowastDiscarding spatial relation model

Table 4 -e results of different caption methods on the Flickr30K

Bleu-1 Bleu-2 Bleu-3 Bleu-4SCA-CNN [19] 0705 0515 0402 0304Semantic ATT [30] 0694 0427 0301 0224SCN-LSTM [20] 0718 0569 0413 0308Softmax-ResNet-Attention 0715 0573 0429 0325Ranking-ResNet-Attention 0729 0581 0416 0319Softmax-ResNet-Attentionlowast 0706 0533 0406 0301Ranking-ResNet-Attentionlowast 0712 0542 0418 0309lowastDiscarding spatial relation model

Complexity 9

Besides our approach is slightly better than the proposedstate-of-the-art methods which verifies the efficiency oftopic-condition in image-captioning Note that our modeluses ResNet-50 as the encoder which is a simple attentionmodel-us our approach is competitive with those models

5 Discussion and Conclusion

-is paper proposes an integrated model to recognize in-stances and objects jointly by leveraging the associatedtextual descriptions and presents a learning algorithm toestimate the model efficiently -e learning process requiresonly separate images and texts without high-level captionsWe use the residual network to deepen the learning of imagesemantic and combine with the text to obtain some hiddenrelation features contained in the picture By constructing ajoint inference module with self-attention we make a fusionof local and global elements We also show that integratingimages and text for deep semantic understanding to label thespatial relation features Furthermore the use of a treeclustering algorithm accelerates the matching process Ex-periments verify that the proposed method achieves com-petitive results on two generic annotation datasets Flickr30kand Microsoft COCO

Data Availability

-e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

-e authors declare no conflicts of interest

Acknowledgments

-e project was sponsored by the National Nature ScienceFoundation of China (41571401 and 41871226) and NaturalScience Foundation of Chongqing China (cstc2019jcyj-msxmX0131)

References

[1] J Zhu L Fang and P Ghamisi ldquoDeformable convolutionalneural networks for hyperspectral image classificationrdquo IEEEGeoscience and Remote Sensing Letters vol 15 no 8pp 1254ndash1258 2018

[2] W Lu J Li T Li W Guo H Zhang and J Guo ldquoWebmultimedia object classification using cross-domain corre-lation knowledgerdquo IEEE Transactions on Multimedia vol 15no 8 pp 1920ndash1929 2013

[3] M Katsurai T Ogawa and M Haseyama ldquoA cross-modalapproach for extracting semantic relationships betweenconcepts using tagged imagesrdquo IEEE Transactions on Mul-timedia vol 16 no 4 pp 1059ndash1074 2014

[4] R Krishna I Chami M Bernstein and L Fei-Fei ldquoReferringrelationshipsrdquo in Proceedings of the IEEECVF Conference onComputer Vision and Pattern Recognition pp 6867ndash6876 SaltLake City UT USA June 2018

[5] B Wang D Lin H Xiong and Y F Zheng ldquoJoint inferenceof objects and scenes with efficient learning of text-object-

scene relationsrdquo IEEE Transactions on Multimedia vol 18no 3 pp 507ndash520 2016

[6] K Fu J Jin R Cui F Sha and C Zhang ldquoAligning where tosee and what to tell image captioning with region-basedattention and scene-specific contextsrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 39 no 12pp 2321ndash2334 2017

[7] Q Wu C Shen L Liu A Dick and A van den HengelldquoWhat value do explicit high level concepts have in vision tolanguage problemsrdquo in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR) pp 203ndash212 Las Vegas NV USA June 2016

[8] M J Choi A Torralba and A S Willsky ldquoA tree-basedcontext model for object recognitionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 34 no 2pp 240ndash252 2012

[9] R Mottaghi X Chen X Liu et al ldquo-e role of context forobject detection and semantic segmentation in the wildrdquo inProceedings of the IEEE Conference on Computer Vision andPattern Recognition Columbus OH USA June 2014

[10] H Hu G-T Zhou Z Deng Z Liao and G Mori ldquoLearningstructured inference neural networks with label relationsrdquo inProceedings of the Computer Vision and Pattern Recognitionpp 2960ndash2968 Las Vegas NV USA June 2016

[11] J Johnson L Ballan and L Fei-Fei ldquoLove thy neighborsimage annotation by exploiting image metadatardquo in Pro-ceedings of the IEEE International Conference on ComputerVision pp 4624ndash4632 Tampa FL USA December 2015

[12] A Farhadi M Hejrati M A Sadeghi et al ldquoEvery picture tellsa story generating sentences from imagesrdquo in Proceedings ofthe European conference on computer vision Springer BerlinHeidelberg 2010

[13] S Li G Kulkarni T L Berg A C Berg and Y ChoildquoComposing simple image descriptions using web-scalen-gramsrdquo in Proceedings of the Conference on ComputationalNatural Language Learning Portland OR USA June 2011

[14] M Mitchell X Han J Dodge et al ldquoGenerating image de-scriptions from computer vision detectionsrdquo in Proceedings ofthe 13th Conference of the European Chapter of the Associationfor Computational Linguistics Avignon France 2012

[15] J Mao W Xu Y Yang J Wang Z Huang and A YuilleldquoDeep captioning with multimodal recurrent neural networks(m-RNN)rdquo in Proceedings of the ICLR pp 1ndash17 San DiegoCA USA May 2015

[16] H Hu G-T Zhou Z Deng Z Liao and G Mori ldquoLearningstructured inference neural networks with label relationsrdquo inProceedings of the CVPR pp 2960ndash2968 Las Vegas NV USAJune 2016

[17] F Liu T Xiang T M Hospedales W Yang and C SunldquoSemantic regularisation for recurrent image annotationrdquo2016 httpsarxivorgabs161105490

[18] Y Niu Z Lu J-R Wen T Xiang and S-F Chang ldquoMulti-modal multi-scale deep learning for large-scale image an-notationrdquo IEEE Transactions on Image Processing vol 28no 4 pp 1720ndash1731 2019

[19] L Chen H Zhang J Xiao et al ldquoSpatial and channel-wiseattention in convolutional networks for image captioningrdquo inProceedings of IEEE Conference Computer Vision and PatternRecognition pp 6298ndash6306 Honolulu Hawaii July 2017

[20] Z Gan ldquoSemantic compositional networks for visual cap-tioningrdquo in Proceedings of IEEE Conference Computer Visionand Pattern Recognition pp 1141ndash1150 Honolulu HI USAJuly 2017

10 Complexity

[21] A Santoro D Raposo D G Barrett et al A Simple NeuralNetwork Module for Relational Reasoning pp 4967-4976

[22] S Harnad ldquo-e symbol grounding problemrdquo Physica DNonlinear Phenomena vol 42 no 1ndash3 pp 335ndash346 1990

[23] M Garnelo K Arulkumaran and M Shanahan ldquoTowardsdeep symbolic reinforcement learningrdquo 2016 httparxivorgabs160905518

[24] B M Lake T D Ullman J B Tenenbaum andS J Gershman ldquoBuilding machines that learn and think likepeoplerdquo Behavioral and Brain Sciences vol 40 2017

[25] D Xu Y Zhu C B Choy and L Fei-Fei ldquoScene graphgeneration by iterative message passingrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognitionpp 3097ndash3106 Honolulu Hawaii July 2017

[26] Y Li W Ouyang B Zhou K Wang and X Wang ldquoScenegraph generation from objects phrases and region captionsrdquoin Proceedings of the IEEE International Conference onComputer Vision pp 1270ndash1279 Venice Italy October 2017

[27] I Yahav O Shehory and D Schwartz ldquoComments miningwith TF-IDF the inherent bias and its removalrdquo IEEETransactions on Knowledge and Data Engineering vol 31no 3 pp 437ndash450 2019

[28] K He X Zhang S Ren et al ldquoDeep residual learning forimage recognitionrdquo in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition pp 770ndash778 LasVegas NA USA July 2016

[29] F Schroff D Kalenichenko and J Philbin ldquoFaceNet a unifiedembedding for face recognition and clusteringrdquo in Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition pp 815ndash823 Boston MA USA 2015

[30] Q You H Jin Z Wang C Fang and J Luo ldquoImage cap-tioning with semantic attentionrdquo in Proceedings of IEEE theConference Computer Vision and Pattern Recognitionpp 4651ndash4659 Las Vegas NV USA June 2016

[31] D Bahdanau K Cho and Y Bengio ldquoNeural Machinetranslation by jointly learning to align and translaterdquo 2014httparxivorgabs14090473

[32] A Karpathy A Joulin and L F Fei-Fei ldquoDeep fragmentembeddings for bidirectional image sentence mappingrdquo inProceedings of the International Conference on Neural Infor-mation Processing Systems pp 1889ndash1897 Montreal Canada2014

[33] H Nam J Ha and J Kim ldquoDual attention networks formultimodal reasoning and matchingrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognitionpp 2156ndash2164 Honolulu HI USA July 2017

[34] P Young A Lai M Hodosh and J Hockenmaier ldquoFromimage descriptions to visual denotations new similaritymetrics for semantic inference over event descriptionsrdquoTransactions of the Association for Computational Linguisticsvol 2 pp 67ndash78 2014

[35] H Hu J Gu Z Zhang J Dai and Y Wei ldquoRelation networksfor object detectionrdquo in Proceedings of the IEEECVF Con-ference on Computer Vision and Pattern Recognitionpp 3588ndash3597 Salt Lake City UT USA June 2018

[36] P Felzenszwalb R Girshick D McAllester and D RamananldquoObject detection with discriminatively trained part basedmodelsrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 32 no 9 2010

[37] B Bahmani ldquoScalable K-Means++rdquo in Proceedings of the VldbEndowment 5(7) pp 622ndash633 Istanbul Turkey August 2012

[38] R Kiros R Salakhutdinov and R S Zemel ldquoUnifying visual-semantic embeddings with multimodal neural languagemodelsrdquo 2014 httparxivorgabs14112539

[39] Y Yu S Tang K Aizawa and A Aizawa ldquoCategory-baseddeep CCA for fine-grained venue discovery from multimodaldatardquo IEEE Transactions on Neural Networks and LearningSystems vol 30 no 4 pp 1250ndash1258 2019

[40] J Johnson A Alahi and L Fei-Fei ldquoPerceptual losses for real-time style transfer and super-resolutionrdquo Computer Vision-ECCV 2016 pp 694ndash711 2016

[41] M Sinecen ldquoComparison of genomic best linear unbiasedprediction and bayesian regularization neural networks for ge-nomic selectionrdquo IEEE Access vol 7 pp 79199ndash79210 2019

[42] L Wang Y Li and S Lazebnik ldquoLearning deep structure-preserving image-text embeddingsrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern RecognitionIEEE Las Vegas NA USA June 2016

[43] Y Y ZhangD S Zhou et al ldquoSingle-image crowd countingvia multi-column convolutional neural networkrdquo in Pro-ceedings of the Conference on Computer Vision and PatternRecognition pp 589ndash597 Las Vegas NV USA June 2016

[44] Y Xian and Y Tian ldquoSelf-guiding multimodal LSTM-whenwe do not have a perfect training dataset for image cap-tioningrdquo IEEE Transactions on Image Processing vol 28no 11 pp 5241ndash5252 2019

[45] A Karpathy and L Fei-Fei ldquoDeep visual-semantic alignmentsfor generating image descriptionsrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 39 no 4pp 664ndash676 2017

[46] K Papineni S Roukos T Ward and W-J Zhu ldquoBLEUAmethod for automatic evaluation of machine translationrdquo inProceedings of the Conference Association for ComputationalLinguistics Philadelphia PA USA 2002

Complexity 11

Page 10: Image-TextJointLearningforSocialImageswithSpatial ...Sep 27, 2019  · two scenarios are straightforward: given an input image (resp.sentence),thegoalistofindtherelationshipsbetween

Besides our approach is slightly better than the proposedstate-of-the-art methods which verifies the efficiency oftopic-condition in image-captioning Note that our modeluses ResNet-50 as the encoder which is a simple attentionmodel-us our approach is competitive with those models

5 Discussion and Conclusion

-is paper proposes an integrated model to recognize in-stances and objects jointly by leveraging the associatedtextual descriptions and presents a learning algorithm toestimate the model efficiently -e learning process requiresonly separate images and texts without high-level captionsWe use the residual network to deepen the learning of imagesemantic and combine with the text to obtain some hiddenrelation features contained in the picture By constructing ajoint inference module with self-attention we make a fusionof local and global elements We also show that integratingimages and text for deep semantic understanding to label thespatial relation features Furthermore the use of a treeclustering algorithm accelerates the matching process Ex-periments verify that the proposed method achieves com-petitive results on two generic annotation datasets Flickr30kand Microsoft COCO

Data Availability

-e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

-e authors declare no conflicts of interest

Acknowledgments

-e project was sponsored by the National Nature ScienceFoundation of China (41571401 and 41871226) and NaturalScience Foundation of Chongqing China (cstc2019jcyj-msxmX0131)

References

[1] J Zhu L Fang and P Ghamisi ldquoDeformable convolutionalneural networks for hyperspectral image classificationrdquo IEEEGeoscience and Remote Sensing Letters vol 15 no 8pp 1254ndash1258 2018

[2] W Lu J Li T Li W Guo H Zhang and J Guo ldquoWebmultimedia object classification using cross-domain corre-lation knowledgerdquo IEEE Transactions on Multimedia vol 15no 8 pp 1920ndash1929 2013

[3] M Katsurai T Ogawa and M Haseyama ldquoA cross-modalapproach for extracting semantic relationships betweenconcepts using tagged imagesrdquo IEEE Transactions on Mul-timedia vol 16 no 4 pp 1059ndash1074 2014

[4] R Krishna I Chami M Bernstein and L Fei-Fei ldquoReferringrelationshipsrdquo in Proceedings of the IEEECVF Conference onComputer Vision and Pattern Recognition pp 6867ndash6876 SaltLake City UT USA June 2018

[5] B Wang D Lin H Xiong and Y F Zheng ldquoJoint inferenceof objects and scenes with efficient learning of text-object-

scene relationsrdquo IEEE Transactions on Multimedia vol 18no 3 pp 507ndash520 2016

[6] K Fu J Jin R Cui F Sha and C Zhang ldquoAligning where tosee and what to tell image captioning with region-basedattention and scene-specific contextsrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 39 no 12pp 2321ndash2334 2017

[7] Q Wu C Shen L Liu A Dick and A van den HengelldquoWhat value do explicit high level concepts have in vision tolanguage problemsrdquo in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR) pp 203ndash212 Las Vegas NV USA June 2016

[8] M J Choi A Torralba and A S Willsky ldquoA tree-basedcontext model for object recognitionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 34 no 2pp 240ndash252 2012

[9] R Mottaghi X Chen X Liu et al ldquo-e role of context forobject detection and semantic segmentation in the wildrdquo inProceedings of the IEEE Conference on Computer Vision andPattern Recognition Columbus OH USA June 2014

[10] H Hu G-T Zhou Z Deng Z Liao and G Mori ldquoLearningstructured inference neural networks with label relationsrdquo inProceedings of the Computer Vision and Pattern Recognitionpp 2960ndash2968 Las Vegas NV USA June 2016

[11] J Johnson L Ballan and L Fei-Fei ldquoLove thy neighborsimage annotation by exploiting image metadatardquo in Pro-ceedings of the IEEE International Conference on ComputerVision pp 4624ndash4632 Tampa FL USA December 2015

[12] A Farhadi M Hejrati M A Sadeghi et al ldquoEvery picture tellsa story generating sentences from imagesrdquo in Proceedings ofthe European conference on computer vision Springer BerlinHeidelberg 2010

[13] S Li G Kulkarni T L Berg A C Berg and Y ChoildquoComposing simple image descriptions using web-scalen-gramsrdquo in Proceedings of the Conference on ComputationalNatural Language Learning Portland OR USA June 2011

[14] M Mitchell X Han J Dodge et al ldquoGenerating image de-scriptions from computer vision detectionsrdquo in Proceedings ofthe 13th Conference of the European Chapter of the Associationfor Computational Linguistics Avignon France 2012

[15] J Mao W Xu Y Yang J Wang Z Huang and A YuilleldquoDeep captioning with multimodal recurrent neural networks(m-RNN)rdquo in Proceedings of the ICLR pp 1ndash17 San DiegoCA USA May 2015

[16] H Hu G-T Zhou Z Deng Z Liao and G Mori ldquoLearningstructured inference neural networks with label relationsrdquo inProceedings of the CVPR pp 2960ndash2968 Las Vegas NV USAJune 2016

[17] F Liu T Xiang T M Hospedales W Yang and C SunldquoSemantic regularisation for recurrent image annotationrdquo2016 httpsarxivorgabs161105490

[18] Y Niu Z Lu J-R Wen T Xiang and S-F Chang ldquoMulti-modal multi-scale deep learning for large-scale image an-notationrdquo IEEE Transactions on Image Processing vol 28no 4 pp 1720ndash1731 2019

[19] L Chen H Zhang J Xiao et al ldquoSpatial and channel-wiseattention in convolutional networks for image captioningrdquo inProceedings of IEEE Conference Computer Vision and PatternRecognition pp 6298ndash6306 Honolulu Hawaii July 2017

[20] Z Gan ldquoSemantic compositional networks for visual cap-tioningrdquo in Proceedings of IEEE Conference Computer Visionand Pattern Recognition pp 1141ndash1150 Honolulu HI USAJuly 2017

10 Complexity

[21] A Santoro D Raposo D G Barrett et al A Simple NeuralNetwork Module for Relational Reasoning pp 4967-4976

[22] S Harnad ldquo-e symbol grounding problemrdquo Physica DNonlinear Phenomena vol 42 no 1ndash3 pp 335ndash346 1990

[23] M Garnelo K Arulkumaran and M Shanahan ldquoTowardsdeep symbolic reinforcement learningrdquo 2016 httparxivorgabs160905518

[24] B M Lake T D Ullman J B Tenenbaum andS J Gershman ldquoBuilding machines that learn and think likepeoplerdquo Behavioral and Brain Sciences vol 40 2017

[25] D Xu Y Zhu C B Choy and L Fei-Fei ldquoScene graphgeneration by iterative message passingrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognitionpp 3097ndash3106 Honolulu Hawaii July 2017

[26] Y Li W Ouyang B Zhou K Wang and X Wang ldquoScenegraph generation from objects phrases and region captionsrdquoin Proceedings of the IEEE International Conference onComputer Vision pp 1270ndash1279 Venice Italy October 2017

[27] I Yahav O Shehory and D Schwartz ldquoComments miningwith TF-IDF the inherent bias and its removalrdquo IEEETransactions on Knowledge and Data Engineering vol 31no 3 pp 437ndash450 2019

[28] K He X Zhang S Ren et al ldquoDeep residual learning forimage recognitionrdquo in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition pp 770ndash778 LasVegas NA USA July 2016

[29] F Schroff D Kalenichenko and J Philbin ldquoFaceNet a unifiedembedding for face recognition and clusteringrdquo in Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition pp 815ndash823 Boston MA USA 2015

[30] Q You H Jin Z Wang C Fang and J Luo ldquoImage cap-tioning with semantic attentionrdquo in Proceedings of IEEE theConference Computer Vision and Pattern Recognitionpp 4651ndash4659 Las Vegas NV USA June 2016

[31] D Bahdanau K Cho and Y Bengio ldquoNeural Machinetranslation by jointly learning to align and translaterdquo 2014httparxivorgabs14090473

[32] A Karpathy A Joulin and L F Fei-Fei ldquoDeep fragmentembeddings for bidirectional image sentence mappingrdquo inProceedings of the International Conference on Neural Infor-mation Processing Systems pp 1889ndash1897 Montreal Canada2014

[33] H Nam J Ha and J Kim ldquoDual attention networks formultimodal reasoning and matchingrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognitionpp 2156ndash2164 Honolulu HI USA July 2017

[34] P Young A Lai M Hodosh and J Hockenmaier ldquoFromimage descriptions to visual denotations new similaritymetrics for semantic inference over event descriptionsrdquoTransactions of the Association for Computational Linguisticsvol 2 pp 67ndash78 2014

[35] H Hu J Gu Z Zhang J Dai and Y Wei ldquoRelation networksfor object detectionrdquo in Proceedings of the IEEECVF Con-ference on Computer Vision and Pattern Recognitionpp 3588ndash3597 Salt Lake City UT USA June 2018

[36] P Felzenszwalb R Girshick D McAllester and D RamananldquoObject detection with discriminatively trained part basedmodelsrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 32 no 9 2010

[37] B Bahmani ldquoScalable K-Means++rdquo in Proceedings of the VldbEndowment 5(7) pp 622ndash633 Istanbul Turkey August 2012

[38] R Kiros R Salakhutdinov and R S Zemel ldquoUnifying visual-semantic embeddings with multimodal neural languagemodelsrdquo 2014 httparxivorgabs14112539

[39] Y Yu S Tang K Aizawa and A Aizawa ldquoCategory-baseddeep CCA for fine-grained venue discovery from multimodaldatardquo IEEE Transactions on Neural Networks and LearningSystems vol 30 no 4 pp 1250ndash1258 2019

[40] J Johnson A Alahi and L Fei-Fei ldquoPerceptual losses for real-time style transfer and super-resolutionrdquo Computer Vision-ECCV 2016 pp 694ndash711 2016

[41] M Sinecen ldquoComparison of genomic best linear unbiasedprediction and bayesian regularization neural networks for ge-nomic selectionrdquo IEEE Access vol 7 pp 79199ndash79210 2019

[42] L Wang Y Li and S Lazebnik ldquoLearning deep structure-preserving image-text embeddingsrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern RecognitionIEEE Las Vegas NA USA June 2016

[43] Y Y ZhangD S Zhou et al ldquoSingle-image crowd countingvia multi-column convolutional neural networkrdquo in Pro-ceedings of the Conference on Computer Vision and PatternRecognition pp 589ndash597 Las Vegas NV USA June 2016

[44] Y Xian and Y Tian ldquoSelf-guiding multimodal LSTM-whenwe do not have a perfect training dataset for image cap-tioningrdquo IEEE Transactions on Image Processing vol 28no 11 pp 5241ndash5252 2019

[45] A Karpathy and L Fei-Fei ldquoDeep visual-semantic alignmentsfor generating image descriptionsrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 39 no 4pp 664ndash676 2017

[46] K Papineni S Roukos T Ward and W-J Zhu ldquoBLEUAmethod for automatic evaluation of machine translationrdquo inProceedings of the Conference Association for ComputationalLinguistics Philadelphia PA USA 2002

Complexity 11

Page 11: Image-TextJointLearningforSocialImageswithSpatial ...Sep 27, 2019  · two scenarios are straightforward: given an input image (resp.sentence),thegoalistofindtherelationshipsbetween

[21] A Santoro D Raposo D G Barrett et al A Simple NeuralNetwork Module for Relational Reasoning pp 4967-4976

[22] S Harnad ldquo-e symbol grounding problemrdquo Physica DNonlinear Phenomena vol 42 no 1ndash3 pp 335ndash346 1990

[23] M Garnelo K Arulkumaran and M Shanahan ldquoTowardsdeep symbolic reinforcement learningrdquo 2016 httparxivorgabs160905518

[24] B M Lake T D Ullman J B Tenenbaum andS J Gershman ldquoBuilding machines that learn and think likepeoplerdquo Behavioral and Brain Sciences vol 40 2017

[25] D Xu Y Zhu C B Choy and L Fei-Fei ldquoScene graphgeneration by iterative message passingrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognitionpp 3097ndash3106 Honolulu Hawaii July 2017

[26] Y Li W Ouyang B Zhou K Wang and X Wang ldquoScenegraph generation from objects phrases and region captionsrdquoin Proceedings of the IEEE International Conference onComputer Vision pp 1270ndash1279 Venice Italy October 2017

[27] I Yahav O Shehory and D Schwartz ldquoComments miningwith TF-IDF the inherent bias and its removalrdquo IEEETransactions on Knowledge and Data Engineering vol 31no 3 pp 437ndash450 2019

[28] K He X Zhang S Ren et al ldquoDeep residual learning forimage recognitionrdquo in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition pp 770ndash778 LasVegas NA USA July 2016

[29] F Schroff D Kalenichenko and J Philbin ldquoFaceNet a unifiedembedding for face recognition and clusteringrdquo in Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition pp 815ndash823 Boston MA USA 2015

[30] Q You H Jin Z Wang C Fang and J Luo ldquoImage cap-tioning with semantic attentionrdquo in Proceedings of IEEE theConference Computer Vision and Pattern Recognitionpp 4651ndash4659 Las Vegas NV USA June 2016

[31] D Bahdanau K Cho and Y Bengio ldquoNeural Machinetranslation by jointly learning to align and translaterdquo 2014httparxivorgabs14090473

[32] A Karpathy A Joulin and L F Fei-Fei ldquoDeep fragmentembeddings for bidirectional image sentence mappingrdquo inProceedings of the International Conference on Neural Infor-mation Processing Systems pp 1889ndash1897 Montreal Canada2014

[33] H Nam J Ha and J Kim ldquoDual attention networks formultimodal reasoning and matchingrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognitionpp 2156ndash2164 Honolulu HI USA July 2017

[34] P Young A Lai M Hodosh and J Hockenmaier ldquoFromimage descriptions to visual denotations new similaritymetrics for semantic inference over event descriptionsrdquoTransactions of the Association for Computational Linguisticsvol 2 pp 67ndash78 2014

[35] H Hu J Gu Z Zhang J Dai and Y Wei ldquoRelation networksfor object detectionrdquo in Proceedings of the IEEECVF Con-ference on Computer Vision and Pattern Recognitionpp 3588ndash3597 Salt Lake City UT USA June 2018

[36] P Felzenszwalb R Girshick D McAllester and D RamananldquoObject detection with discriminatively trained part basedmodelsrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 32 no 9 2010

[37] B Bahmani ldquoScalable K-Means++rdquo in Proceedings of the VldbEndowment 5(7) pp 622ndash633 Istanbul Turkey August 2012

[38] R Kiros R Salakhutdinov and R S Zemel ldquoUnifying visual-semantic embeddings with multimodal neural languagemodelsrdquo 2014 httparxivorgabs14112539

[39] Y Yu S Tang K Aizawa and A Aizawa ldquoCategory-baseddeep CCA for fine-grained venue discovery from multimodaldatardquo IEEE Transactions on Neural Networks and LearningSystems vol 30 no 4 pp 1250ndash1258 2019

[40] J Johnson A Alahi and L Fei-Fei ldquoPerceptual losses for real-time style transfer and super-resolutionrdquo Computer Vision-ECCV 2016 pp 694ndash711 2016

[41] M Sinecen ldquoComparison of genomic best linear unbiasedprediction and bayesian regularization neural networks for ge-nomic selectionrdquo IEEE Access vol 7 pp 79199ndash79210 2019

[42] L Wang Y Li and S Lazebnik ldquoLearning deep structure-preserving image-text embeddingsrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern RecognitionIEEE Las Vegas NA USA June 2016

[43] Y Y ZhangD S Zhou et al ldquoSingle-image crowd countingvia multi-column convolutional neural networkrdquo in Pro-ceedings of the Conference on Computer Vision and PatternRecognition pp 589ndash597 Las Vegas NV USA June 2016

[44] Y Xian and Y Tian ldquoSelf-guiding multimodal LSTM-whenwe do not have a perfect training dataset for image cap-tioningrdquo IEEE Transactions on Image Processing vol 28no 11 pp 5241ndash5252 2019

[45] A Karpathy and L Fei-Fei ldquoDeep visual-semantic alignmentsfor generating image descriptionsrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 39 no 4pp 664ndash676 2017

[46] K Papineni S Roukos T Ward and W-J Zhu ldquoBLEUAmethod for automatic evaluation of machine translationrdquo inProceedings of the Conference Association for ComputationalLinguistics Philadelphia PA USA 2002

Complexity 11