From Image to Text in Sentiment Analysis via Regression and … · 2020. 1. 23. · known as Kernel Ridge Regression (KRR). The KRR method is a regularized least squares method that

Proceedings of Recent Advances in Natural Language Processing, pages 862–868,Varna, Bulgaria, Sep 2–4, 2019.

https://doi.org/10.26615/978-954-452-056-4_100

862

From Image to Text in Sentiment Analysis via Regression and DeepLearning

Daniela Onita

Faculty of Mathematicsand Computer ScienceUniversity of Bucharest

[email protected]

Liviu P. DinuFaculty of Mathematicsand Computer ScienceUniversity of Bucharest

[email protected]

Birlutiu Adriana

Computer Science Department1 Decembrie 1918 University

of Alba [email protected]

Abstract

Images and text represent types of con-tent which are used together for conveyinguser emotions in online social networks.These contents are usually associated witha sentiment category. In this paper, we in-vestigate an approach for mapping imagesto text for three types of sentiment cate-gories: positive, neutral and negative. Themapping from images to text is performedusing a Kernel Ridge Regression model.We considered two types of image fea-tures: i) RGB pixel-values features, and ii)features extracted with a deep learning ap-proach. The experimental evaluation wasperformed on a Twitter data set containingboth text and images and the sentiment as-sociated with these. The experimental re-sults show a difference in performance fordifferent sentiment categories, in particu-lar the mapping that we propose performsbetter for the positive sentiment categoryin comparison with the neutral and nega-tive ones. Furthermore, the experimentalresults show that the more complex deeplearning features perform better than theRGB pixel-value features for all sentimentcategories and for larger training sets.

1 Introduction

A quick look at an image is sufficient for a humanto say a few words related to that image. How-ever, this very easy task for humans is a very diffi-cult task for the existing computer vision systems.The majority of previous work in computer vi-sion has focused on labeling images with a fixedset of visual categories. However, even thoughclosed vocabularies of visual concepts are a conve-nient modeling assumption, they are quite restric-

tive when compared to the vast amount of rich de-scriptions and impressions that a human can com-pose.

Some approaches that address the challengeof generating image descriptions have been pro-posed (Kulkarni et al., 2013; Karpathy and Fei-Fei, 2015). However, these models only rely onobjective image descriptors, and do not take intoaccount the subjectivity which appears when de-scribing an image on social networks.

In this work, we want to take a step forward to-wards the goal of generating subjective descrip-tions of images that are close to the natural lan-guage that is used in social networks. Figure 1gives a hint to the motivation of our work by show-ing several samples which were used in the experi-mental evaluation. Each sample consists of an im-age and the subjective text associated to it, and hashas a sentiment associated to it: negative, neutralor positive.

The goal of our work is to generate subjectivedescriptions of images. The main challenge to-wards this goal is in the design of a model that isrich enough to simultaneously reason about con-tents of images and their representations in naturallanguage domain. Additionally, the model shouldbe free of assumptions about specific templates orcategories and instead rely on learning from thetraining data. The model will go beyond the sim-ple description of an image and give also a subjec-tive impression that the image could make upon acertain person. An example of this is shown in theimage from bottom-right of Figure 1, in which wedo not have a captioning or description of the ani-mal in the image but the subjective impression thatthe image makes upon the looker.

Our core insight is that we can map images tosubjective natural text by leveraging the image-sentence data set in a supervised learning approachin which the image represents the input and the

863

This was so kak sad as a child

They fighting over who more useless

NEWS: Rams mailbag: How much hasCase Keenum improved? #SPORTS #LATIMES

Well that's adorable

a. b.

c. d.

Figure 1: Motivation Figure: Our model treats lan-guage as a rich label space and generates subjec-tive descriptions of images. Examples of samplesused in the experimental evaluation. Each sam-ple consists of a pair made of an image and thesubjective text associated to it. Each sample has asentiment associated to it: a., b. samples conveya negative sentiment; c. sample conveys a neutralsentiment; d. sample conveys a positive sentiment.

sentence represents the output. We employ a Ker-nel Ridge Regression for the task of mapping im-ages to text. We considered two types of imagefeatures: i) RGB pixel-values features, and ii) fea-tures extracted with a deep learning approach. Weused a bag-of-words model to construct the textfeatures. In addition, we consider several sen-timent categories associated to each image-textsample, and analyze this mapping in the contextof these sentiment categories.

We investigate data from Twitter. These datacontain images and text associated to each image.The text is a subjective description or impressionof the image, written by a user. Data from so-cial networks, and especially Twitter, is usuallyassociated to a sentiment, which could be a pos-itive, neutral or negative sentiment. We designeda system that automatically associates an image toa set of words from a dictionary, these words be-ing not only descriptors of the content of the im-age, but also subjective impressions and opinionsof the image.

One of the interesting findings of our work isthat there is a difference in performance for dif-ferent sentiment categories, in particular the map-

ping performs better for the positive sentiment cat-egory in comparison with the neutral and negativecategories. Furthermore, the experimental resultsshow that the more complex deep learning featuresperform better than the RGB pixel-value featuresfor all sentiment categories and for larger trainingsets.

The paper is organized as follows. Section 2discusses related works. Section 3 describes aKernel Ridge Regression model for image to textmapping. Section 4 shows the experimental evalu-ation performed on a real-world data set. Section 5finishes with conclusions and directions for futureresearch.

2 Related Work

Image captioning. The research presented in thispaper is in the direction of image captioning, butgoes further to map images to text. The texts thatwe consider are not only descriptions of the im-ages, which is the task of image captioning, butthey contain subjective statements related to theimages. Mapping images to text is an extensionof the image captioning task, and this mapping al-lows us to build some dictionaries of words andselect from these dictionaries the words which arethe most relevant to an image. The learning set-ting that we investigate in this paper is differentto the image captioning setting, because our sys-tem automatically associates an image to a set ofwords from a dictionary, these words being notonly descriptors of the content of the image, butalso subjective opinions of the image. Image cap-tioning has been actively studied in last years, arecent survey on image captioning is given in (Baiand An, 2018). Several approaches for image cap-tioning are making use of the deep learning tech-niques (Bai and An, 2018; P. Singam, 2018).

Image description. Several approaches that ad-dress the challenge of generating image descrip-tions have been proposed (Kulkarni et al., 2013;Karpathy and Fei-Fei, 2015; Park et al., 2017;Ling and Fidler, 2017). However, these modelsonly rely on objective image descriptors, and donot take into account the subjectivity which ap-pears when describing an image on social net-works.

Sentiment analysis. We investigate mappingimages to text in the context of sentiment analysis.Most of the previous research in sentiment analy-sis is performed on text data. Recent works focus

864

on sentiment analysis in images and videos (Yuet al., 2016; You et al., 2015; Wang et al., 2016).The research on visual sentiment analysis pro-ceeds along two dimensions: i) based on hand-crafted features and ii) based on features gener-ated automatically. Deep Learning techniques arecapable of automatically learning robust featuresfrom a large number of images (Jindal and Singh,2015). An interesting direction for sentiment anal-ysis is related to word representations and capsulenetworks for NLP applications (Xing et al., 2019;Zhao et al., 2019).

3 Kernel Ridge Regression for MappingImages to Text

Let X = {x1, x2, . . . , xn} and Y ={y1, y2, . . . , yn} be the set of inputs and out-puts, respectively, and n represents the numberof observations. And let FX ∈ RdX×n andFY ∈ RdY ×n denote the input and output featurematrices, where dX , dY represent the dimensionsof the input and output features respectively.The inputs represent the images, and the inputfeatures can be either simple RGB pixel-valuesor something more complex, such as featuresextracted automatically using convolutionalneural networks (O’Shea and Nash, 2015). Theoutputs represent the texts associated to theimages and the output features can be extractedusing Word2Vec (Ma and Zhang, 2015).

A mapping between the inputs and the out-puts can be formulated as a multi-linear regres-sion problem (Cortes et al., 2005, 2007). Com-bined with Tikhonov regularization, this is alsoknown as Kernel Ridge Regression (KRR). TheKRR method is a regularized least squares methodthat is used for classification and regression tasks.It has the following objective function:

argW min(1

2||WFX − F T

Y ||2F + α1

2||W ||2F ) (1)

where || · ||F is the Frobenius norm, α is a regu-larization term and the superscript T signifies thetranspose of the matrix.

The solution of the optimization problem fromEquation 1 involves the Moore-Penrose pseudo-inverse and has the following closed-form expres-sion:

W = FY FTX(FXF

TX + αIdX )−1 ∈ RdY ×dX (2)

which for low-dimensional feature spaces(dX , dY ≤ n) can be calculated explicitly (the

IdX in Equation 2 represents the identity matrixof dimension dX ).

For high-dimensional data, an explicit compu-tation of W as presented in Equation 2 withoutprior dimensionality reduction is computationallyexpensive. Fortunately, Equation 2 can be rewrit-ten as:

W = FY FTX(FXF

TX + αIdx)−1

= FY (FTXFX + αIn)−1FT

X (3)

Making use of the kernel trick, the inputsxi are implicitly mapped to a high-dimensionalReproducing Kernel Hilbert space (Berlinet andThomas-Agnan, 2011):

Φ = [φ(x1), . . . , φ(xn)]. (4)

When predicting a target ynew from a new obser-vation xnew, explicit access to Φ is never actuallyneeded:

ynew = FY (ΦTΦ + αIn)−1ΦTφ(xnew)

= FY (K + αIn)−1κ(xnew) (5)

With Kij = φ(xi)Tφ(xj) and κ(xnew)i =

φ(xi)Tφ(xnew), the prediction can be described

entirely in terms of inner products in the higher-dimensional space. Not only does this approachwork on the original data sets without the needof dimensionality reduction, but it also opens upways to introduce non-linear mappings into the re-gression by considering different types of kernels,such as a Gaussian or a polynomial kernel.

4 Experimental Evaluation

4.1 DatasetWe used a data set with images and text that wasintroduced in (Vadicamo et al., 2017). The datahave been collected from Twitter posts over a pe-riod of 6 months, and using an LSTM-SVM ar-chitecture, the tweets have been divided into threesentiment categories: positive, neutral, and nega-tive. For image labelling the authors have selecteddata with the most confident textual sentiment pre-dictions and they used these predictions to au-tomatically assign sentiment labels to the corre-sponding images. In our experimental evaluationwe selected 10000 images and the corresponding10000 tweets from each of the three sentiment cat-egories. Figure 1 shows examples of image andtext data used in the experimental evaluation.

865

1 Aircraft carrier2 Fire boat3 Drilling platform4 Dock5 Submarine

1 Racket2 Scoreboard3 Ballplayer4 Flagpole5 Stage

1 Persian cat2 Tabby3 Pekinese4 Egyptian cat5 Tiger cat

1 Plate2 Cheeseburger3 Carbonara4 Hotdog5 Meat loaf

Figure 2: Visualizing heatmaps of class activation in an image.

4.2 Image and Text Features

4.2.1 Image FeaturesThe research on feature extraction from imagesproceeds along two directions: i) traditional, hand-crafted features, and ii) automatically generatedfeatures. With the increasing number of imagesand videos on the web, traditional methods havea hard time handling the scalability and general-ization problem. In contrast, automated generatedfeature-based techniques are capable to automat-ically learn robust features from a large numberof images (Jindal and Singh, 2015). We discussbelow how these two directions for extracting fea-tures from images apply in our case, in particular,we use RGB pixel-values features for the first di-rection and Deep Learning based features for thesecond direction.

RGB pixel-values. In this approach for extract-ing features from images, we simply convert theimages into arrays. Each image was sliced to getthe RGB data. The 3-channels RGB image for-mat was preferred instead of using 1-channel im-age format since we wanted to use all the availableinformation related to an image. Using this ap-proach, each image was described by a 2352 (28 x28 x 3)-dimensional feature vector.

Deep Learning based features. Deep Learningmodels use a cascade of layers to discover featurerepresentations from data. Each layer of a con-volutional network produces an activation for thegiven input. Earlier layers capture low-level fea-tures of the image like blobs, edges, and colors.This primitive features are abstracted by the high-level layers. Studies from the literature suggestthat while using pre-trained networks for featureextraction, the features should be extracted fromthe layer right before the classification layer (Ra-

jaraman et al., 2018). For this reason, we ex-tracted the features from the last layer before thefinal classification, so the entire convolutional basewas used for this. The features were extracted us-ing the pre-trained convolutional base VGG16 net-work (Simonyan and Zisserman, 2014). For com-putational reasons, the images were resampled toa 3232 pixel resolution. The model was initializedby the ImageNet weights. For understanding whatpart of an image was used to extract the features,visualizing heatmaps of class activation techniquewas employed. This is a technique which illus-trates how intensely the input image activates dif-ferent channels, how important each channel iswith regard to the class and how intensely the inputimage activates the class. Figure 2 illustrates theheatmaps of class activation for some random im-ages using VGG16 as a pre-trained convolutionalbase. The VGG16 model makes the final clas-sification decision based on the highlighted partsfrom each image, and furthermore each image isassociated with the five most representative cap-tions.

4.2.2 Text Features

We used a Bag-of-Words (BoW) model (Harris,1954) for extracting the features from the text sam-ples. The first step in building the BoW modelconsists of pre-processing the text: removing non-letter characters, removing the html tag from theTwitter posts, converting words to lower cases, re-moving stop-words and making the split. A vo-cabulary is built from the words that appear in thetext samples. The input of the BoW model is a listof strings and the output is a sparse matrix withthe dimension: number of samples x number ofwords in the vocabulary, having 1 if a given wordfrom the vocabulary is contained in that particular

866

text sample. We initialized the BoW model witha maximum of 5000 features. We extracted a vo-cabulary for each sentiment category, and the cor-responding 0-1 feature vector for each text sample.

4.3 Experimental ResultsEvaluation MeasureEach output of our model represents a very largevector of probabilities, with the dimension equalto the number of words in the dictionary (approx-imately 5000 components). Each component ofthe output vector represents the probability of thecorresponding word from the vocabulary as beinga descriptor of that image. Given this particularform of the output, the evaluation measure wascomputed using the following algorithm:

1. we sorted in descending order the absolutevalues of the predicted output vector;

2. we created a new vector containing the first50 words from the predicted output vector;

3. we computed the Euclidean distance betweenthe predicted output vector values and the ac-tual output vector.

The actual output vector is a sparse vector, a com-ponent in this vector is 1 if the corresponding wordfrom the vocabulary is contained in that particulardescription of the image.

The values computed in step 3) described abovewere averaged over the entire test data set and theaverage value obtained was considered as the er-ror.

Experimental ProtocolWe designed an experimental protocol, that wouldhelp us answer the following questions:

1. Could our proposed Kernel Ridge Regressionmodel map images to natural language de-scriptors?

2. What is the difference between the two typesof image features that we considered? In par-ticular, we are interested whether the morecomplex deep learning features give a bet-ter performance in comparison to the simpleRGB pixel-values features.

3. Is there a difference in performance based onthe sentiment associated to each image-textsample?

Figure 3: The plots show mean errors and stan-dard deviation for different sizes of the trainingset. Comparison between RGB pixel-values fea-tures and the more complex VGG16 features. Thedifferent rows correspond to different sentimentcategories: top row - positive sentiment category,middle row - neutral sentiment category, bottomrow - negative sentiment category.

867

Figure 4: Comparison of the learning performancebased on the type of sentiment using the VGG16image features.

We designed the following experimental proto-col. For each of the three sentiment categories, werandomly split the data 5 times into training andtesting, taking 70% for training and the rest fortesting. For training the model, we considered dif-ferent sizes of the training set: from 50 to 7000observations with a step size of 50. For a cor-rect evaluation, the models built on these differenttraining sets, were evaluated on the same test set.The error was averaged over the 5 random splits ofthe data into training and testing.

ResultsThe first two questions raised above can be an-swered by analyzing the experimental resultsshown in Figure 3. The plots show the learningcurve (mean errors and standard deviations) fordifferent sizes of the training set and for differentsentiment categories. Since the error decreases asthe training size increases, we can say that there isa learning involved, thus our proposed model canmap images to natural language descriptors.

The plots from Figure 3 also show the compar-ison between the RGB pixel-values and VGG16features for the three categories of sentiments con-sidered. Overall, the more complex deep learningfeatures give a better performance in comparisonto the simple RGB pixel-values features.

To answer the third question, we analyzed theexperimental results shown in Figure 4. Thereis a significant difference in learning performancefor the positive sentiment category in comparisonwith the other two categories, both using RGBpixel-values features and VGG16 features. Thepositive category is simpler to be learned becauseof the subjective part from images: a positive feel-

ing can be interpreted as positive for the majorityof the people, but a neutral or a negative sentimentcan be interpreted as having a different meaningdepending on the people.

Furthermore, analyzing again Figure 3, we seethat the neutral sentiment category has a differ-ent behaviour in comparison with the positive andnegative sentiment categories, with respect to theimage features used. In the case of neutral sen-timent, the more complex VGG16 features ap-pear to have a better performance than the sim-pler RGB pixel-values features as the size of thedata increases. For positive and negative sentimentcategories the simpler RGB pixel-values featureslead to an error which varies a lot, while using theVGG16 features, the error is more stable.

5 Conclusions and Future Work

In this work, we investigated a method for imageto text mapping in the context of sentiment anal-ysis. The mapping from images to text was per-formed using a Kernel Ridge Regression model.We considered two types of image features: i) thesimple RGB pixel-values features, and ii) a morecomplex set of image features extracted with adeep learning approach. Furthermore, in this pa-per we took a step forward form the image cap-tioning task, which allows us to build some dictio-naries of words and select from these dictionariesthe words which are the most relevant to an im-age. We performed the experimental evaluation ona Twitter data set containing both text and imagesand the sentiment associated with these. We foundthat there is a difference in performance for differ-ent sentiment categories, in particular the mappingperforms better for the positive sentiment categoryin comparison with the neutral and negative onesfor both features extraction techniques.

We plan to further extend our approach by in-vestigating the input-output kernel regression typeof learning (Brouard et al., 2016). The output ker-nel would allow us to take into account the struc-ture in the output space and benefit from the useof kernels. We also plan to integrate in our modeltextual captions of images obtained using a pre-trained network (Simonyan and Zisserman, 2014).The textual captions could be used as a new type offeatures and can be compared and integrated withthe other two types of image features considered.

868

ReferencesBai, S. and An, S. (2018). A survey on automatic im-

age caption generation. Neurocomputing, 311.

Berlinet, A. and Thomas-Agnan, C. (2011). Reproduc-ing Kernel Hilbert Spaces in Probability and Statis-tics. Springer Science Business Media.

Brouard, C., Szafranski, M., and D’Alche-Buc, F.(2016). Input output kernel regression: super-vised and semi-supervised structured output predic-tion with operator-valued kernels. The Journal ofMachine Learning Research, 17(1):6105–6152.

Cortes, C., Mohri, M., and Weston, J. (2005). A gen-eral regression technique for learning transductions.In Proceedings of ICML 2005, pages 153–160. ACMPress.

Cortes, C., Mohri, M., and Weston, J. (2007). Ageneral regression framework for learning stringto-string mappings. In Predicting Structured Data.MIT Press.

Harris, Z. (1954). Distributional structure. pages 146–62.

Jindal, S. and Singh, S. (2015). Image sentimentanalysis using deep convolutional neural networkswith domain specific fine tuning. In 2015 In-ternational Conference on Information Processing(ICIP), pages 447–451. IEEE.

Karpathy, A. and Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descrip-tions. In Proceedings of the IEEE conferenceon computer vision and pattern recognition, pages3128–3137.

Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li,S., Choi, Y., Berg, A. C., and Berg, T. L. (2013).Babytalk: Understanding and generating simple im-age descriptions. IEEE Transactions on PatternAnalysis and Machine Intelligence, 35(12):2891–2903.

Ling, H. and Fidler, S. (2017). Teaching machinesto describe images via natural language feedback.In Proceedings of the 31st International Conferenceon Neural Information Processing Systems, pages5075–5085. Curran Associates Inc.

Ma, L. and Zhang, Y. (2015). Using word2vec to pro-cess big text data. In 2015 IEEE International Con-ference on Big Data (Big Data), pages 2895–2897.IEEE.

O’Shea, K. and Nash, R. (2015). An introduction toconvolutional neural networks. ArXiv e-prints.

P. Singam, Ashvini Bhandarkar, B. M. C. T. K. K.(2018). Automated image captioning using con-vnets and recurrent neural network. InternationalJournal for Research in Applied Science and Engi-neering Technology.

Park, C., Kim, B., and Kim, G. (2017). Attend toyou: Personalized image captioning with context se-quence memory networks. In Proceedings of theIEEE Conference on Computer Vision and PatternRecognition, pages 895–903.

Rajaraman, S., Antani, S. K., Poostchi, M., Sila-mut, K., Hossain, M. A., Maude, R. J., Jaeger, S.,and Thoma, G. R. (2018). Pre-trained convolu-tional neural networks as feature extractors towardimproved malaria parasite detection in thin bloodsmear images. PeerJ, 6:e4568.

Simonyan, K. and Zisserman, A. (2014). Very deepconvolutional networks for large-scale image recog-nition. arXiv preprint arXiv:1409.1556.

Vadicamo, L., Carrara, F., Cimino, A., Cresci, S.,Dell’Orletta, F., Falchi, F., and Tesconi, M. (2017).Cross-media learning for image sentiment analysisin the wild. In The IEEE International Conferenceon Computer Vision (ICCV) Workshops.

Wang, J., Fu, J., Xu, Y., and Mei, T. (2016). Be-yond object recognition: Visual sentiment analysiswith deep coupled adjective and noun neural net-works. In Proceedings of the Twenty-Fifth Interna-tional Joint Conference on Artificial Intelligence, IJ-CAI’16, pages 3484–3490. AAAI Press.

Xing, F. Z., Pallucchini, F., and Cambria, E. (2019).Cognitive-inspired domain adaptation of sentimentlexicons. Information Processing & Management,56(3):554–564.

You, Q., Luo, J., Jin, H., and Yang, J. (2015). Jointvisual-textual sentiment analysis with deep neuralnetworks. In Proceedings of the 23rd ACM Inter-national Conference on Multimedia, MM ’15, pages1071–1074, New York, NY, USA. ACM.

Yu, Y., Lin, H., Meng, J., and Zhao, Z. (2016). Vi-sual and textual sentiment analysis of a microblogusing deep convolutional neural networks. Algo-rithms, 9(2):41.

Zhao, W., Peng, H., Eger, S., Cambria, E., and Yang,M. (2019). Towards scalable and reliable capsulenetworks for challenging nlp applications. arXivpreprint arXiv:1906.02829.

Documents

From Image to Text in Sentiment Analysis via Regression and … · 2020. 1. 23. · known as Kernel Ridge Regression (KRR). The KRR method is a regularized least squares method that