Contextual Inter-modal Attention for Multi-modal Sentiment

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3454–3466Brussels, Belgium, October 31 - November 4, 2018. c©2018 Association for Computational Linguistics

3454

Contextual Inter-modal Attention for Multi-modal Sentiment Analysis

Deepanway Ghosal†, Md Shad Akhtar†, Dushyant Chauhan†, Soujanya Poria?,Asif Ekbal† and Pushpak Bhattacharyya†

† Department of Computer Science & Engineering, Indian Institute of Technology Patna, India{deepanway.me14,shad.pcs15,1821CS18,asif,pb}@iitp.ac.in

? School of Computer Science and Engineering, Nanyang Technological University, [email protected]

Abstract

Multi-modal sentiment analysis offers variouschallenges, one being the effective combina-tion of different input modalities, namely text,visual and acoustic. In this paper, we proposea recurrent neural network based multi-modalattention framework that leverages the con-textual information for utterance-level senti-ment prediction. The proposed approach ap-plies attention on multi-modal multi-utterancerepresentations and tries to learn the contribut-ing features amongst them. We evaluate ourproposed approach on two multi-modal senti-ment analysis benchmark datasets, viz. CMUMulti-modal Opinion-level Sentiment Inten-sity (CMU-MOSI) corpus and the recently re-leased CMU Multi-modal Opinion Sentimentand Emotion Intensity (CMU-MOSEI) corpus.Evaluation results show the effectiveness ofour proposed approach with the accuracies of82.31% and 79.80% for the MOSI and MO-SEI datasets, respectively. These are approxi-mately 2 and 1 points performance improve-ment over the state-of-the-art models for thedatasets.

1 Introduction

Traditionally, sentiment analysis (Pang and Lee,2005, 2008) has been applied to a wide varietyof texts (Hu and Liu, 2004; Liu, 2012; Turney,2002; Akhtar et al., 2016, 2017; Mohammad et al.,2013). In contrast, multi-modal sentiment analysishas recently gained attention due to the tremen-dous growth of many social media platforms suchas YouTube, Instagram, Twitter, Facebook (Chenet al., 2017; Poria et al., 2016, 2017d,b; Zadehet al., 2017, 2016) etc. It depends on the infor-mation that can be obtained from more than onemodality (e.g. text, visual and acoustic) for theanalysis. The motivation is to leverage the vari-eties of (often distinct) information from multiplesources for building an efficient system. For ex-

ample, it is a non-trivial task to detect the senti-ment of a sarcastic sentence “My neighbours arehome!! it is good to wake up at 3am in the morn-ing.” as negative considering only the textual in-formation. However, if the system has access tosome other sources of information, e.g. visual, itcan easily detect the unpleasant gestures of thespeaker and would classify it with the negativesentiment polarity. Similarly, for some instancesacoustic features such as intensity, pitch, pauseetc. have important roles to play in the correctnessof the system. However, combining these informa-tion in an effective manner is a non-trivial task thatresearchers often have to face (Zadeh et al., 2017;Chen et al., 2017).

A video provides a good source for extractingmulti-modal information. In addition to the visualframes, it also provides information such as acous-tic and textual representation of spoken language.Additionally, a speaker can utter multiple utter-ances in a single video and these utterances canhave different sentiments. The sentiment informa-tion of an utterance often has inter-dependence onother contextual utterances. Classifying such anutterance in an independent manner poses manychallenges to the underlying algorithm.

In this paper, we propose a novel method thatemploys a recurrent neural network based multi-modal multi-utterance attention framework forsentiment prediction.We hypothesize that apply-ing attention to contributing neighboring utter-ances and/or multi-modal representations may as-sist the network to learn in a better way. The mainchallenge in multi-modal sentiment analysis liesin the proper utilization of the information ex-tracted from multiple modalities. Although it is of-ten argued that incorporation of all the availablemodalities is always beneficial for enhanced per-formance, it must be noted that not all the modal-ities play equal role. Another concern in multi-

3455

modal framework is that the presence of noise inone modality can affect the overall performance.To better address these concerns we propose anovel fusion method by focusing on inter-modalityrelations computed between the target utteranceand its context. We argue that in multi-modal sen-timent classification, not only the relation amongtwo modalities of the same utterance is important,but also relatedness with the modalities across itscontext are important.

Think of an utterance Ut that constitutes ofthree modalities, say At (i.e. audio), Vt (i.e. vi-sual) and Tt (i.e. text). Let us also assume Uk beinga member of the contextual utterances consistingof the modalities - Ak, Vk and Tk. In this case,our model computes the relatedness among themodalities (for e.g., Vt and Tk) of Ut and Uk in or-der to produce a richer multi-modal representationfor final classification. The attention mechanismis then used to attend to the important contextualutterances having higher relatedness or similarity(computed using inter-modality correlations) withthe target utterance.

Unlike previous approaches that simply applyattentions over the contextual utterance for classi-fication, we attend over the contextual utterancesby computing correlations among the modali-ties of the target utterance and the context ut-terances. This explicitly helps us to distinguishwhich modalities of the relevant contextual utter-ances are more important for sentiment predic-tion of the target utterance. The model facilitatesthis modality selection by attending over the con-textual utterances and thus generates better multi-modal feature representation when these modali-ties from the context are combined with the modal-ities of the target utterance. We evaluate our pro-posed approach on two recent benchmark datasets,i.e. CMU-MOSI (Zadeh et al., 2016) and CMU-MOSEI (Zadeh et al., 2018c), with one being thelargest (CMU-MOSEI) available dataset for multi-modal sentiment analysis (c.f. Section 4.1). Evalu-ation shows that the proposed attention frameworkattains better performance than the state-of-the-artsystems for various combinations of input modal-ities (i.e. text, visual & acoustic).

The main contributions of our proposed workare three-fold: a) we propose a novel technique formulti-modal sentiment analysis; b) we propose aneffective attention framework that leverages con-tributing features across multiple modalities and

neighboring utterances for sentiment analysis; andc) we present the state-of-the-art systems for senti-ment analysis in two different benchmark datasets.

2 Related Work

A survey of the literature suggests that multi-modal sentiment prediction is relatively a new areaas compared to textual based sentiment prediction(Morency et al., 2011; Mihalcea, 2012; Poria et al.,2016, 2017b; Zadeh et al., 2018a). A good re-view covering the literature from uni-modal anal-ysis to multi-modal analysis is presented in (Po-ria et al., 2017a). An application of multi-kernellearning based fusion technique was proposed in(Poria et al., 2016), where they employed deepconvolutional neural networks for extracting thetextual features and fused it with other (visual &acoustic) modalities for prediction.

Zadeh et al. (2016) introduced the multi-modaldictionary to better understand the interaction be-tween facial gestures and spoken words when ex-pressing the sentiment. Authors introduced theMOSI dataset, the first of its kind to enable thestudies of multi-modal sentiment intensity analy-sis. Zadeh et al. (2017) proposed a Tensor FusionNetwork (TFN) model to learn the intra-modalityand inter-modality dynamics of the three modali-ties (i.e. text, visual and acoustic). They reportedthe improved accuracy using multi-modality onthe CMU-MOSI dataset. An application to lever-age on the gated multi-modal embedded LongShort Term Memory (LSTM) with temporal at-tention (GME-LSTM(A)) for the word-level fu-sion of multi-modality inputs is proposed in (Chenet al., 2017). The Gated Multi-modal Embedding(GME) alleviates the difficulties of fusion whilethe LSTM with Temporal Attention (LSTM(A))performs word-level fusion.

The works mentioned above did not take con-textual information into account. Poria et al.(2017b) proposed a LSTM based framework thatleverages the contextual information to capturethe inter-dependencies between the utterances. Inanother work, Poria et al. (2017d) proposed anuser opinion based framework to combine thethree modality inputs (i.e. text, visual & acous-tic) by applying a multi-kernel learning basedmethod. Zadeh et al. (2018a) proposed multi-attention blocks (MAB) to capture informationacross three modalities (text, visual & acoustic).They reported improved accuracies in the range of

3456

2-3% over the state-of-the-art models for the dif-ferent datasets.

The fundamental difference between our pro-posed method and the existing works is that ourframework applies focus on the neighboring ut-terances to leverage contextual information forutterance-level sentiment prediction. To the best ofour knowledge, our current work is the very firstof its kind that attempts to employ multi-modal at-tention block (exploiting neighboring utterances)for sentiment prediction. We use multi-modal at-tention framework that leverages contributing fea-tures across multiple modalities and the neighbor-ing utterances for sentiment analysis.

3 Proposed Methodology

In our proposed framework, we aim to leveragethe multi-modal and contextual information forpredicting the sentiment of an utterance. Utter-ances of a particular speaker in a video representthe time series information and it is logical thatthe sentiment of a particular utterance would af-fect the sentiments of the other neighboring utter-ances. To model the relationship with the neigh-boring utterances and multi-modality, we proposea recurrent neural network based multi-modal at-tention framework. The proposed framework takesmulti-modal information (i.e. text, visual & acous-tic) for a sequence of utterances and feeds it intothree separate bi-directional Gated Recurrent Unit(GRU) (Cho et al., 2014). This is followed by adense (fully-connected) operation which is sharedacross the time-steps or utterances (one each fortext, visual & acoustic). We then apply multi-modal attention on the outputs of the dense layers.The objective is to learn the joint-association be-tween the multiple modalities & utterances, and toemphasize on the contributing features by puttingmore attention to these. In particular, we employbi-modal attention framework, where an atten-tion function is applied to the representations ofpairwise modalities i.e. visual-text, text-acousticand acoustic-visual. Finally, the outputs of pair-wise attentions along with the representations areconcatenated and passed to the softmax layer forclassification. We call our proposed architectureMulti-Modal Multi-Utterance - Bi-Modal Atten-tion (MMMU-BA) framework. An overall archi-tecture of the proposed MMMU-BA framework isillustrated in Figure 1. Please refer to Figure 3 inappendix for illustration of attention computation.

For comparison, we also experiment withtwo other variants of the proposed MMMU-BAframework i.e. a). Multi-Modal Uni-Utterance-Self Attention (MMUU-SA) framework and b).Multi-Utterance-Self Attention (MU-SA) frame-work. The architecture of these variants differwith respect to the attention computation moduleand the naming conventions “MMMU”, “MMUU”or “MU” signify the information that partici-pates in the attention computation. For example,in MMMU-BA, we compute attention over themulti-modal and multi-utterance inputs, whereasin MMUU-SA, the attention is computed over themutli-modal but uni-utterance inputs. In contrast,we compute attention over only multi-utterance in-puts in MU-SA. Rest of the components for all thethree variants remain same.

3.1 Multi-modal Multi-utterance - Bi-modalAttention (MMMU-BA) Framework

Assuming a particular video has ‘u’ utterances, theraw utterance level multi-modal features are rep-resented as TR ∈ Ru×300 (raw text), VR ∈ Ru×35

(raw visual) and AR ∈ Ru×74 (raw acoustic).Three separate Bi-GRU layers with forward &backward state concatenation are first applied onthe raw data followed by the fully-connected denselayers, resulting in T ∈ Ru×d (text), V ∈ Ru×d

(visual) and A ∈ Ru×d (acoustic), where ‘d’ isthe number of neurons in the dense layer. Finally,pairwise-attentions are computed on various com-binations of three modalities- (V, T), (T, A) & (A,V). In particular the attention between V and T iscomputed as follows:•Bi-modal Attention: Modality representations ofV & T are obtained from the Bi-GRU network,and hence contain the contextual information ofthe utterances for each modality. At first, we com-pute a pair of matching matrices M1,M2 ∈ Ru×u

over two representations that account for the cross-modality information.

M1 = V.T T & M2 = T.V T

• Multi-Utterance Attention: As mentioned ear-lier, in the proposed model we aim to leverage thecontextual information of each utterance for theprediction. We compute the probability distribu-tion scores (N1 ∈ Ru×u & N2 ∈ Ru×u) overeach utterance of bi-modal attention matrices M1

& M2 using a softmax function. This essentiallycomputes the attention weights for the contextual

3457

Bi-G

RU

Dense

Bi-G

RU

Dense

Bi-G

RU

Dense

AR (raw acoustic)utt1utt2uttu

VR (raw video)utt1utt2uttu

TR (raw text)utt1utt2uttu

Attention

Attention

Attention

concat

softmax

Figure 1: Overall architecture of the proposed MMMU-BA framework.

utterances. Finally, soft attention is applied overthe multi-modal multi-utterance attention matricesto compute the modality-wise attentive representa-tions (i.e. O1 & O2).

N1(i, j) =eM1(i,j)∑uk=1 e

M1(i,k)for i, j = 1, .., u

N2(i, j) =eM2(i,j)∑uk=1 e

M2(i,k)for i, j = 1, .., u.

O1 = N1.T & O2 = N2.V

• Multiplicative Gating & Concatenation: Fi-nally, a multiplicative gating function following(Dhingra et al., 2016) is computed between themulti-modal utterance specific representations ofeach individual modality and the other modalities.This element-wise matrix multiplication assists inattending to the important components of multiplemodalities and utterances.

A1 = O1 � V & A2 = O2 � T

Attention matrices A1 & A2 are then concatenatedto obtain the MMMU-BAVT ∈ Ru×2d between Vand T.

MMMU-BAV T = concat[A1, A2]

MMMU-BAAV & MMMU-BATA computations:Similar to MMMU-BAVT, we follow the sameprocedure to compute MMMU-BAAV & MMMU-BATA. For a data source comprising of raw vi-sual (VR), acoustic (AR) & text (TR) modalities,at first, we compute the bi-modal attention pairsfor each combination i.e. MMMU-BAVT, MMMU-BAAV & MMMU-BATA. Finally, motivated by theresidual skip connection network (He et al., 2016),

we concatenate the bi-modal attention pairs withindividual modalities (i.e. V, A & T) to boost thegradient flow to the lower layers. This concate-nated feature is then used for final classification.

3.2 Multi-Modal Uni-Utterance - SelfAttention (MMUU-SA) Framework

MMUU-SA framework does not account for infor-mation from the other utterances at the attentionlevel, rather it utilizes multi-modal information ofsingle utterance for predicting the sentiment. For avideo having ‘q’ utterances, ‘q’ separate attentionblocks are needed, where each block computes theself-attention over multi-modal information of asingle utterance. Let Xup ∈ R3×d is the informa-tion matrix of the pth utterance where the three ‘d’dimensional rows are the outputs of the dense lay-ers for the three modalities.

The attention matrix Aup ∈ R3×d is computedseparately for, p = 1st, 2nd, ... qth utterances. Fi-nally, for each utterance p, Aup and Xup are con-catenated and passed to the output layer for clas-sification. Please refer to the appendix for moredetails.

3.3 Multi-Utterance - Self Attention (MU-SA)Framework

In MU-SA framework, we apply self attention onthe utterances of each modality separately, and usethese for classification. In contrast to MMUU-SAframework, MU-SA utilizes the contextual infor-mation of the utterances at the attention level. Let,T ∈ Ru×d (text), V ∈ Ru×d (visual) and A ∈Ru×d (acoustic) are the outputs of the dense lay-ers. For the three modalities, three separate atten-tion blocks are required, where each block takes

3458

multi-utterance information of a single modalityand computes the self attention matrix. Attentionmatrices At, Av and Aa are computed for text, vi-sual and acoustic, respectively. Finally Av, At, Aa,V , T & A are concatenated and passed to the out-put layer for classification.

4 Datasets, Experiments and Analysis

In this section we describe the datasets used forour experiments and report the results along withthe necessary analysis.

4.1 Datasets

We evaluate our proposed approach on twobenchmark datasets, namely CMU Multi-modalOpinion-level Sentiment Intensity (CMU-MOSI)corpus (Zadeh et al., 2016) and the recently pub-lished CMU Multi-modal Opinion Sentiment andEmotion Intensity (CMU-MOSEI) dataset (Zadehet al., 2018c). CMU-MOSI dataset consists of 93videos spanning over 2199 utterances. Each utter-ance has a sentiment label associated with it. It has52, 10 & 31 videos in training, validation & test setaccounting for 1151, 296 & 752 utterances.

CMU-MOSEI has 3229 videos with 22676 ut-terances from more than 1000 online YouTubespeakers. The training, validation & test set com-prise of 16216, 1835 & 4625 utterances, respec-tively. More details about these datasets are pre-sented in the appendix.

Each utterance in CMU-MOSI dataset has beenannotated as either positive or negative, whereas inCMU-MOSEI dataset labels are in the continuousrange of -3 to +3. However, in this work we projectthe instances of CMU-MOSEI in a two-class clas-sification setup with values ≥ 0 signify positivesentiments and values < 0 signify negative sen-timents. We adopt such a strategy to be consistentwith the previous published works on CMU-MOSIdatasets (Poria et al., 2017b; Chen et al., 2017).

4.2 Feature extraction

We use the CMU-Multi-modal Data SDK1 (Zadehet al., 2018a) for feature extraction. For MOSEIdataset, word-level features were provided wheretext features were extracted by GloVe embeddings,visual features by Facets2 & acoustic features byCovaRep (Degottex et al., 2014). Thereafter, we

1https://github.com/A2Zadeh/CMU-MultimodalDataSDK

2https://pair-code.github.io/facets/

compute the average of word-level features in anutterance to obtain the utterance-level features.For each word, the dimension of the feature vectoris set to 300 (text), 35 (visual) & 74 (acoustic).

In contrast, for MOSI dataset we use utterance-level features3 provided in (Poria et al., 2017b).These utterance-level features represent the out-puts of a convolutional neural network (Karpathyet al., 2014), 3D convolutional neural network (Jiet al., 2013) & openSMILE (Eyben et al., 2010)for text, visual & acoustic modalities, respectively.Dimensions of utterance-level features are 100,100 & 73 for text, visual & acoustic, respectively.

4.3 Experiments

We evaluate our proposed approach for CMU-MOSI (test data) & CMU-MOSEI (dev data) 4.Accuracy score is used as the evaluation metric.

We use Bi-directional GRUs having 300 neu-rons, each followed by a dense layer consisting of100 neurons. Utilizing the dense layer, we projectthe input features of all the three modalities to thesame dimensions. We set dropout=0.5 (MOSI) &0.3 (MOSEI) as a measure of regularization. Inaddition, we also use dropout=0.4 (MOSI) & 0.3(MOSEI) for the Bi-GRU layers. We employ ReLuactivation function in the dense layers, and soft-max activation in the final classification layer. Fortraining the network we set the batch size=32, useAdam optimizer with cross-entropy loss functionand train for 50 epochs. We report the average re-sult of 5 runs for all our experiments.

We experiment with all the valid combinationsof uni-modal (where only one modality is takenat a time), bi-modal (any two modalities are takenat a time) and tri-modal (all three modalities aretaken at a time) inputs for text, visual and acoustic.In multi-modal attention frameworks i.e. MMMU-BA & MMUU-SA, the attention is computed overat least two modalities, hence, these two frame-works are not-applicable (NA) for uni-modal ex-periments in Table 1).

For MOSEI dataset, we obtain better per-formance with text. Subsequently, we take twomodalities at a time for constructing bi-modal in-puts and feed it to the network. For text-acousticinput pairs, we obtain the highest accuracies with79.74%, 79.60% and 79.32% for MMMU-BA,

3https://github.com/SenticNet/contextual-sentiment-analysis

4Gold annotation of CMU-MOSEI test data wasn’t re-leased at the time of paper submission.

https://github.com/A2Zadeh/CMU-MultimodalDataSDK

https://github.com/A2Zadeh/CMU-MultimodalDataSDK

https://pair-code.github.io/facets/

https://github.com/SenticNet/contextual-sentiment-analysis

https://github.com/SenticNet/contextual-sentiment-analysis

3459

Modality T V ACMU-MOSEI CMU-MOSI

MMMU-BA MMUU-SA MU-SA MMMU-BA MMUU-SA MU-SA

Uni-modalX - - NA NA 78.23 NA NA 80.18- X - NA NA 74.84 NA NA 63.70- - X NA NA 75.88 NA NA 62.10

Bi-modalX X - 79.40 79.02 79.26 81.51 80.85 80.45X - X 79.74 79.60 79.32 80.58 80.31 79.78- X X 76.66 76.46 76.43 65.16 64.22 63.22

Tri-modal X X X 79.80 79.76 79.63 82.31 79.52 80.58

Table 1: Experimental results on CMU-MOSEI and CMU-MOSI datasets. MMMU-BA & MMUU-SAframeworks require atleast two modalities to compute the attentions, hence, these two frameworks arenot-applicable (NA) for uni-modal inputs. All results are average of 5 runs with different random seeds.T: Text, V: Visual, A: Acoustic. Results are reported in accuracy.

MMUU-SA and MU-SA frameworks, respectively.The results that we obtain from the bi-modal com-binations suggest that the text-acoustic combina-tion is a better choice than the others as it improvesthe overall performance. Finally, we experimentwith tri-modal inputs and observe an improvedperformance of 79.80%, 79.76% and 79.63% forMMMU-BA, MMUU-SA and MU-SA frameworks,respectively. This improvement entails that combi-nation of all the three modalities is a better choice.The performance improvement was also foundto be statistically significant (T-test) than the bi-modality and uni-modality inputs. Further, we ob-serve that the MMMU-BA framework reports thebest accuracy of 79.80% for the MOSEI dataset,thus supporting our claim that multi-modal atten-tion framework (i.e. MMMU-BA) captures moreinformation than the self-attention frameworks(i.e. MMUU-SA & MU-SA).

4.4 Analysis of Attention Mechanism

We analyze the attention values to understand thelearning behavior of the proposed architecture.To illustrate, we take an example video from theCMU-MOSI test dataset. The transcript of the ut-terances for this particular video are presented inTable 2. The gold sentiments are positive for allthe utterances except u3 & u4. We found that theproposed tri-modal MMMU-BA model predicts thelabels of all the nine instances correctly, whereasother models make at least one misclassification.For the proposed tri-modal MMMU-BA model, theheatmaps of the pair-wise MMMU-BA softmax at-tention weights N1 & N2 of visual-text, acoustic-visual & text-acoustic are illustrated in Figure 2a,Figure 2b & Figure 2c, respectively. N1 & N2 arethe softmax attention weights obtained from the

pairwise matching matrices M1 & M2. Elementsof the rows of N1 & N2 matrices signify differ-ent weights across multiple utterances. From theattention heatmaps, it is evident that by applyingdifferent weights across contextual utterances andmodalities the model is able to predict labels of allthe utterances correctly. All the heatmaps justifythat the model learns to incorporate multi-modal &multi-utterance information and thus is able to cor-rectly predict the labels of all the utterances. Forexample, heatmap of MMMU-BAVT (Figure 2a)signifies that elements of N1 are weighted higherthan N2, and thus the model puts more attention onthe textual part and relatively lesser on the visualpart (as N1 is multiplied with T & N2 is multipliedwith V). Also it can be concluded that textual fea-tures of the first few utterances are the most help-ful compared to the rest of the textual features andvisual features.

The softmax attention weights of text (Nt), vi-sual (Nv) & acoustic (Na) in tri-modal MU-SAmodel are illustrated in Figure 2d, Figure 2e &Figure 2f, respectively. The attention matrices are9*9 dimensional. This model wrongly predictsthe label of the utterance u5. On the other hand,softmax attention weights in tri-modal MMUU-SAmodel are illustrated in Figure 2g. Nine separateattention weights (Nu1 , Nu2 , .., Nu9) are com-puted for the nine utterances. This model wronglypredicts the labels of the utterances u4 & u5.

We further analyze our proposed architecture(i.e. MMMU-BA) with and without attention. InMOSI for tri-modal inputs, the MMMU-BA archi-tecture reports a reduced accuracy of 80.89% with-out attention framework as compared to 82.31%with attention. We observe similar performancein the MOSEI dataset, where we obtain 79.02%

3460

Transcript Gold Label PredictedMMMU-BA MMUU-SA MU-SA

u1 well he plays that well so he‘s good villain Positive Positive Positive Positiveu2 he also has some really cool guns so that its like a

desert eagle but it has to barrels to itPositive Positive Positive Positive

u3 it‘s pretty pretty scary looking Negative Negative Negative Negativeu4 i wouldn‘t want that pointed at me Negative Negative Positive Negativeu5 and i who would i mean Positive Positive Negative Negativeu6 um like i said i thought the movie was great Positive Positive Positive Positiveu7 the action they do have is really well done Positive Positive Positive Positiveu8 um they did a good job with car Positive Positive Positive Positiveu9 they did a good job with fight scenes Positive Positive Positive Positive

Table 2: Transcript, gold labels and predicted labels of a video in CMU-MOSI dataset having nine utter-ances. 7 utterances are labeled positive whereas 2 utterances are labeled negative. Predicted labels arefor our different tri-modal models. Bolded labels are misclassified by at least one model.

u1 u2 u3 u4 u5 u6 u7 u8 u9 u1 u2 u3 u4 u5 u6 u7 u8 u9

u9

u8

u7

u6

u5

u4

u3

u2

u1

0.008

0.016

0.024

0.032

0.040

0.048

(a) Softmax attention weights N1 & N2

for MMMU- BAVT.


u9

u8

u7

u6

u5

u4

u3

u2

u1

0.012

0.016

0.020

0.024

0.028

(b) Softmax attention weights N1 & N2

for MMMU- BAAV.


u9

u8

u7

u6

u5

u4

u3

u2

u1

0.02

0.04

0.06

0.08

0.10

(c) Softmax attention weights N1 & N2

for MMMU- BATA.

u1 u2 u3 u4 u5 u6 u7 u8 u9

u9

u8

u7

u6

u5

u4

u3

u2

u1

0.0

0.1

0.2

0.3

0.4

0.5

(d) MU-SAtext matrix.u1 u2 u3 u4 u5 u6 u7 u8 u9

u9

u8

u7

u6

u5

u4

u3

u2

u1

0.0

0.2

0.4

0.6

0.8

(e) MU-SAvisual matrix.u1 u2 u3 u4 u5 u6 u7 u8 u9

u9

u8

u7

u6

u5

u4

u3

u2

u1

0.02

0.04

0.06

0.08

(f) MU-SAacoustic matrix.

u1(visual) u1(acoustic) u1(text)

u1(text)

u1(acoustic)

u1(visual)

0.15

0.30

0.45

0.60

0.75


u2(text)

u2(acoustic)

u2(visual)

0.15

0.30

0.45

0.60

0.75


u3(text)

u3(acoustic)

u3(visual)

0.15

0.30

0.45

0.60

0.75


u4(text)

u4(acoustic)

u4(visual)

0.2

0.4

0.6

0.8


u5(text)

u5(acoustic)

u5(visual)

0.2

0.4

0.6

0.8


u6(text)

u6(acoustic)

u6(visual)

0.15

0.30

0.45

0.60

0.75


u7(text)

u7(acoustic)

u7(visual)

0.15

0.30

0.45

0.60

0.75

0.90


u8(text)

u8(acoustic)

u8(visual)

0.15

0.30

0.45

0.60

0.75


u9(text)

u9(acoustic)

u9(visual)

0.15

0.30

0.45

0.60

0.75

(g) Softmax attention weights (Nu1 , Nu2 , .., Nu9 ) for MMUU-SA model.

Figure 2: (a), (b) & (c): Pair-wise softmax attention weights N1 & N2 of visual-text, acoustic-visual &text-acoustic in Tri-modal MMMU-BA model. Solid line at the center represents boundary of N1 & N2.The heatmaps represent attention weights of a particular utterance with respect to other utterances in N1

& N2. (d), (e) & (f) Softmax attention weights of text (Nt), visual (Nv) and acoustic (Na) in Tri-modalMU-SA model. This model wrongly predicts the label of utterance u5. (g) Softmax attention weights ofthe 9 utterances (NU1 , NU2 , .., NU9) in Tri-modal MMUU-SA model. This model wrongly predicts thelabel of utterance u4 & u5. The Tri-modal MMMU-BA model predicts all 9 instances correctly, whereas,the other two models makes at least one misclassification. Heatmap signifies that the model is able topredict labels of all the utterances correctly by incorporating multi-modal & multi-utterance information.

3461

T V ACMU-MOSEI CMU-MOSI

w/ attention w/o attention w/ attention w/o attention

X X - 79.40 78.27 81.51 80.71X - X 79.74 78.12 80.58 80.18- X X 76.66 76.32 65.16 63.69

X X X 79.80 79.02 82.31 80.89

Table 3: Analysis of attention mechanism inMMMU-BA architecture. w/ attention → withmulti-modal multi utterance attention mechanismand w/o attention→without attention mechanism.

accuracy without attention framework against79.80% accuracy with attention framework. Sta-tistical T-test shows these improvements to be sig-nificant. We also observed the similar trends forbi-modal inputs in both the datasets. All these ex-periments (c.f. Table 3) suggest that the attentionframework is an important component in our pro-posed architecture, and in absence of this the net-work finds it more difficult for learning in all thecases (i.e. bi-modal & tri-modal input setups).

We successfully show that attention computa-tion on pairwise combination of modalities (i.e. bi-modal attention framework) is more effective thanthe combination of self-attention on single modal-ity. Further for the completeness of the proposedapproach, we also experiment with tri-modal at-tention framework (attention is computed on threemodalities at a time). Though the results that weobtain are convincing, it does not improve the per-formance over the bi-modal attention framework.We obtain the accuracies of 79.58% & 81.25% onMOSEI and MOSI, respectively, for the tri-modalattention framework.

4.5 Comparative AnalysisFor MOSI datasets we compare the performanceof our proposed approach with the the followingstate-of-the-art systems: i). Poria et al. (2017b)-LSTM-based sequence model to capture the con-textual information of the utterances; ii). Poriaet al. (2017c)- Tensor level fusion technique forcombining all the three modalities; iii). Chenet al. (2017)-A gated multi-modal embeddedLSTM with temporal attention (GME-LSTM(A))for word-level fusion of multi-modality inputs.and iv). Zadeh et al. (2018a)- Multiple attentionblocks for capturing the information across thethree modalities.

In Table 4 we present the comparative perfor-mance between our proposed model and otherstate-of-the-art systems. In MOSI dataset, Poria

et al. (2017b; 2017c) reported the accuracies of80.3% & 81.3 %, respectively, utilizing tri-modalinputs. Zadeh et al. (2018a) obtained an accuracyof & 77.4%. Chen et al. (2017) reported accuraciesof 75.7% (LSTM(A)) & 76.5% (GME-LSTM(A))for two variants of their model. In contrast to thestate-of-the-art systems, our proposed model at-tains an improved accuracy of 82.31% when weutilize all the three modalities, i.e. text, visual &acoustic. Our proposed system also obtains betterperformance as compared to the state-of-the-artsfor bi-modal inputs.

For MOSEI dataset, we evaluate against the fol-lowing systems: i) Poria et al. (2017b), ii) Zadehet al. (2018a), and iii) Zadeh et al. (2018b), whereauthors proposed a memory fusion network formulti-view sequential learning. We evaluate thesystem of Poria et al. (2017b) on MOSEI datasetand obtain 77.64% accuracy with the tri-modal in-puts. Authors in (Zadeh et al., 2018a) & (Zadehet al., 2018b) reported the accuracy 76.0% and76.4%, respectively, with the tri-modal inputs. Incomparison, our proposed approach yields an ac-curacy of 79.80%. As reported in Table 4 the pro-posed approach also attains better performance forall the bi-modal and uni-modal input combinationswhen compared to Poria et al. (2017b).

As reported in Table 4, we observe that the per-formance achieved in our proposed approach issignificantly better in comparison to the state-of-the-art systems with p-value< 0.05 (obtained us-ing T-test). For further analysis, we also report re-sults for three-class classification (positive, neu-tral & negative classes) problem setup for MOSEIdataset in Table 7. Note that this setup is not feasi-ble in MOSI as labels are only positive or negative.

4.6 Error Analysis

We perform error analysis on the predictions ofour proposed MMMU-BA model with all the threeinput sources. Confusion matrices for both thedatasets are demonstrated in Table 5. For MO-SEI dataset we observe that the precision and re-call for positive class (84% precision & 88% re-call; are quite encouraging. However, the same arecomparatively on the lower side for the negativeclass (68% precision & 58% recall. In contrast, forthe MOSI dataset - which is relatively balanced -we obtain quite similar performance for both theclasses i.e. positive (86% precision & 85% recall)and negative (77% precision & 75% recall). Please

3462

Modality T V ACMU-MOSEI CMU-MOSI

Poriaet al.

(2017b)

Zadehet al.

(2018a)

Zadehet al.

(2018b)

Proposed Poriaet al.

(2017b)

Poriaet al.

(2017c)

Chen et al. (2017) Zadehet al.

(2018a)

Proposed

LSTM(A)GME-

LSTM(A)

Uni-modalX - - 76.75 - - 78.23 78.1 - 71.3 - - 80.18- X - 71.84 - - 74.84 60.3 - 52.3 - - 63.70

- - X 70.94 - - 75.88 55.8 - 55.4 - - 62.10

Bi-modalX X - 77.03 - - 79.40 79.3 79.9 74.3 - - 81.51X - X 76.89 - - 79.74 80.2 80.1 73.5 - - 80.58- X X 72.74 - - 76.66 62.1 62.9 - - - 65.16

Tri-modal X X X 77.64 76.0 76.4 79.80 80.3 81.3 75.7 76.5 77.4 82.31T-test (p-values) - - - 0.0025 - - - - - 0.0006

Table 4: Comparative analysis of the proposed approach with recent state-of-the-art systems. SignificanceT-test p-values < 0.05

MOSEI102 234

1230 269

Positive Negative

63 215

404 70

Positive Negative

MOSI

Table 5: Con-fusion matrixfor tri-modalMMMU-BA.

Text Actual Predicted Possible Reason

MO

SI

At first I thought the movie would appeal more to younger audience. negative positiveImplicit sentiment.

Its really non-stop from beginning to end. negative positiveBut its action isn’t particularly memorable. negative positive

Negation & strong word.I mean I don’t regret seeing it. positive negativeUm I was really looking forward to it. negative positive Sarcastic sentence.

MO

SEI

And when I was going to school it was really difficult for me to findavenues and resources to be able to reach higher education.

negative positiveImplicit sentiment.

We could have a decision from the court on the stay any day now. positive negativeHolidays never really happen in online courses I guess. negative positive

Negation & strong word.Young people dropping out of the labour market are actually notcounted anymore as unemployed as they are inactive.

positive negative

Thank you for your efforts and consideration. negative positive Sarcastic sentence.

Table 6: Error Analysis: Frequent error cases and their possible reasons of failurefor the tri-modal MMMU-BA framework.

Metric

CMU-MOSEIPoria et al.

(2017b)MMMU-BA(Tri-modal)

Accuracy 61.89 63.30F1 Score 61.60 63.07

Table 7: Three class (positive, negative, neutral)classification results in MOSEI dataset.

refer to the appendix for PR curves of different in-put combinations.

We further analyze our outputs qualitatively andlist a few frequently occurring error categorieswith examples in Table 6.

5 Conclusion

In this paper, we have proposed a recurrent neu-ral network based multi-modal attention frame-work that leverages the contextual information forutterance-level sentiment prediction. The networklearns on top of three modalities, viz. text, vi-sual and acoustic, considering sequence of utter-

ances in a video. Through evaluation results ontwo benchmark datasets (one being the popular &commonly used (MOSI) and other being the mostrecent & largest (MOSEI) dataset for multi-modalsentiment analysis), we successfully showed thatthe proposed attention based framework performsbetter than various state-of-the-art systems.

In future, we would like to investigate new tech-niques, and explore the ways to handle implicitsentiment and sarcasm. Future direction of workalso include adding more dimensions, e.g. emo-tion analysis & intensity prediction.

6 Acknowledgment

The research reported here is partially supportedby SkyMap Global India Private Limited. AsifEkbal acknowledges the Young Faculty ResearchFellowship (YFRF), supported by VisvesvarayaPhD scheme for Electronics and IT, Ministry ofElectronics and Information Technology (MeitY),Government of India, being implemented by Dig-ital India Corporation (formerly Media Lab Asia).

3463

ReferencesMd Shad Akhtar, Deepak Gupta, Asif Ekbal, and Push-

pak Bhattacharyya. 2017. Feature selection and en-semble construction: A two-step method for aspectbased sentiment analysis. Knowledge-Based Sys-tems, 125:116 – 135.

Md Shad Akhtar, Ayush Kumar, Asif Ekbal, and Push-pak Bhattacharyya. 2016. A Hybrid Deep LearningArchitecture for Sentiment Analysis. In Proceed-ings of the 26th International Conference on Com-putational Linguistics (COLING 2016): TechnicalPapers, December 11-16, 2016, pages 482–493, Os-aka, Japan.

Minghai Chen, Sen Wang, Paul Pu Liang, Tadas Bal-trusaitis, Amir Zadeh, and Louis-Philippe Morency.2017. Multimodal sentiment analysis with word-level fusion and reinforcement learning. In Proceed-ings of the 19th ACM International Conference onMultimodal Interaction, pages 163–171. ACM.

Kyunghyun Cho, Bart Van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using rnn encoder-decoderfor statistical machine translation. arXiv preprintarXiv:1406.1078.

G. Degottex, J. Kane, T. Drugman, T. Raitio, andS. Scherer. 2014. Covarep - a collaborative voiceanalysis repository for speech technologies. In2014 IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP), pages960–964.

Bhuwan Dhingra, Hanxiao Liu, Zhilin Yang,William W Cohen, and Ruslan Salakhutdinov.2016. Gated-attention readers for text comprehen-sion. arXiv preprint arXiv:1606.01549.

Florian Eyben, Martin Wollmer, and Bjorn Schuller.2010. Opensmile: the munich versatile and fastopen-source audio feature extractor. In Proceedingsof the 18th ACM international conference on Multi-media, pages 1459–1462. ACM.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep residual learning for image recog-nition. In Proceedings of the IEEE conference oncomputer vision and pattern recognition, pages 770–778.

M. Hu and B. Liu. 2004. Mining and summarizingcustomer reviews. In Proceedings of the 10th KDD,pages 168–177, Seattle, WAs.

Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2013.3d convolutional neural networks for human actionrecognition. IEEE transactions on pattern analysisand machine intelligence, 35(1):221–231.

Andrej Karpathy, George Toderici, Sanketh Shetty,Thomas Leung, Rahul Sukthankar, and Li Fei-Fei.

2014. Large-scale video classification with convolu-tional neural networks. In Proceedings of the IEEEconference on Computer Vision and Pattern Recog-nition, pages 1725–1732.

Bing Liu. 2012. Sentiment Analysis and Opinion Min-ing. Synthesis Lectures on Human Language Tech-nologies. Morgan & Claypool Publishers.

Rada Mihalcea. 2012. Multimodal sentiment analy-sis. In Proceedings of the 3rd Workshop in Com-putational Approaches to Subjectivity and SentimentAnalysis, WASSA@ACL 2012, July 12, 2012, Jeju Is-land, Republic of Korea, page 1.

Saif Mohammad, Svetlana Kiritchenko, and XiaodanZhu. 2013. NRC-Canada: Building the state-of-the-art in sentiment analysis of tweets. In SecondJoint Conference on Lexical and Computational Se-mantics (*SEM), Volume 2: Proceedings of the Sev-enth International Workshop on Semantic Evalua-tion (SemEval 2013), pages 321–327, Atlanta, Geor-gia, USA. Association for Computational Linguis-tics.

Louis-Philippe Morency, Rada Mihalcea, and PayalDoshi. 2011. Towards multimodal sentiment anal-ysis: harvesting opinions from the web. In Pro-ceedings of the 13th International Conference onMultimodal Interaction, ICMI 2011, Alicante, Spain,November 14-18, 2011, pages 169–176.

Bo Pang and Lillian Lee. 2005. Seeing Stars: Ex-ploiting Class Relationships for Sentiment Catego-rization with Respect to Rating Scales. In Proceed-ings of the 43rd Annual Meeting on Association forComputational Linguistics, ACL ’05, pages 115–124, Stroudsburg, PA, USA. Association for Com-putational Linguistics.

Bo Pang and Lillian Lee. 2008. Opinion mining andsentiment analysis. Found. Trends Inf. Retr., 2(1-2):1–135.

Soujanya Poria, Erik Cambria, Rajiv Bajpai, and AmirHussain. 2017a. A review of affective computing:From unimodal analysis to multimodal fusion. In-formation Fusion, 37:98–125.

Soujanya Poria, Erik Cambria, Devamanyu Hazarika,Navonil Majumder, Amir Zadeh, and Louis-PhilippeMorency. 2017b. Context-dependent sentimentanalysis in user-generated videos. In Proceedings ofthe 55th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers), vol-ume 1, pages 873–883.

Soujanya Poria, Erik Cambria, Devamanyu Haz-arika, Navonil Mazumder, Amir Zadeh, and Louis-Philippe Morency. 2017c. Multi-level multiple at-tentions for contextual multimodal sentiment analy-sis. In Data Mining (ICDM), 2017 IEEE Interna-tional Conference on, pages 1033–1038. IEEE.

3464

Soujanya Poria, Iti Chaturvedi, Erik Cambria, andAmir Hussain. 2016. Convolutional mkl based mul-timodal emotion recognition and sentiment analysis.In Data Mining (ICDM), 2016 IEEE 16th Interna-tional Conference on, pages 439–448. IEEE.

Soujanya Poria, Haiyun Peng, Amir Hussain, NewtonHoward, and Erik Cambria. 2017d. Ensemble appli-cation of convolutional neural networks and multiplekernel learning for multimodal sentiment analysis.Neurocomputing, 261:217–230.

P. D. Turney. 2002. Thumbs up or thumbs down?: Se-mantic orientation applied to unsupervised classifi-cation of reviews. In Proceedings of the 40th Asso-ciation for Computational Linguistics (ACL), pages417–424, Philadelphia, USA.

A Zadeh, PP Liang, S Poria, P Vij, E Cambria, andLP Morency. 2018a. Multi-attention recurrent net-work for human communication comprehension. InThirty-Second AAAI Conference on Artificial Intel-ligence (AAAI-2018), pages 5642 – 5649, New Or-leans, USA.

A. Zadeh, R. Zellers, E. Pincus, and L. P. Morency.2016. Multimodal Sentiment Intensity Analysis inVideos: Facial Gestures and Verbal Messages. IEEEIntelligent Systems, 31(6):82–88.

Amir Zadeh, Minghai Chen, Soujanya Poria, ErikCambria, and Louis-Philippe Morency. 2017. Ten-sor fusion network for multimodal sentiment analy-sis. In Proceedings of the 2017 Conference on Em-pirical Methods in Natural Language Processing,pages 1103–1114, Copenhagen, Denmark. Associ-ation for Computational Linguistics.

Amir Zadeh, Paul Pu Liang, Navonil Mazumder,Soujanya Poria, Erik Cambria, and Louis-PhilippeMorency. 2018b. Memory Fusion Network forMulti-view Sequential Learning. arXiv preprintarXiv:1802.00927.

Amir Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cam-bria, and Louis-Philippe Morency. 2018c. Mul-timodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic FusionGraph. In Proceedings of the 56th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers), pages 2236–2246. Associ-ation for Computational Linguistics.

7 Appendix

7.1 Multi-Modal Uni-Utterance - SelfAttention (MMUU-SA) Framework

Let Xp ∈ R3×d is the information matrix of thepth utterance where the three ‘d’ dimensional rowsare the outputs of the time-distributed dense layerfor the three modalities. Computation in the pth

attention block proceeds as follows:

Mup = Xup .XTup

Nup(i, j) =eMup (i,j)∑3k=1 e

Mup (i,k)for i, j = 1, 2, 3;

Oup = Nup .Xup

Aup = Oup �Xup

The attention matrix Aup ∈ R3×d is computedseparately for, p = 1st, 2nd, ... qth utterances. Fi-nally, for each utterance p, Aup and Xup are con-catenated and passed to the output layer for classi-fication.

7.2 Multi-Utterance - Self Attention (MU-SA)Framework

In MU-SA framework, we apply self attention onthe utterances of each modality separately, and usethese for classification. In contrast to MMUU-SAframework, MU-SA utilizes the contextual infor-mation of the utterances at the attention level. Let,T ∈ Ru×d (text), V ∈ Ru×d (visual) and A ∈Ru×d (acoustic) are the outputs of the dense lay-ers. For the three modalities, three separate atten-tion blocks are required, where each block takesmulti-utterance information of a single modalityand computes the self attention matrix. Specifi-cally, the MU-SA attention (Av) on V (visual) willbe computed as follows,

Mv = V.V T

Nv(i, j) =eMv(i,j)∑uk=1 e

Mv(i,k)for i, j = 1, .., u

Ov = Nv.V

Av = Ov � V

The attention matrix Ap ∈ R3×d is computed forp = 1st, 2nd, ...uth utterances. Finally, for each ut-terance u, Ap and Xp are concatenated and passedto the output layer with softmax activation forclassification.

7.3 Dataset Statistics

Dataset statistics are presented in Table 8.

7.4 Attention Computation

MMMU-BAVT attention computation is illustratedin Figure 3.

3465

StatisticsCMU-MOSI CMU-MOSEI

Tr Dv Ts Tr Dv Ts

#Videos 52 10 31 2250 300 679#Utterance 1151 296 752 16216 1835 4625#Utterance/Video - Min 9 9 10 1 1 1#Utterance/Video - Max 63 34 43 98 37 52#Utterance/Video - Avg 24.692 22.9 22.129 7.207 6.116 6.821#Positive 556 153 467 11498 1332 3281#Negative 595 143 285 4718 503 1344#Words/Utter. - Min 1 1 1 1 1 1#Words/Utter. - Max 99 44 108 515 224 549#Words/Utter. - Avg 11.533 10.786 13.176 18.227 18.498 18.658#Utter-Len/Video - Min 0.219s 0.648s 0.229s 0.089s 0.22s 0.15s#Utter-Len/Video - Max 38.233s 13.599s 31.957s 208.27s 90.42s 188.22s#Utter-Len/Video - Avg 3.635s 3.538s 4.536s 6.896s 6.960s 7.158s#Speakers 89 1000

(a) Data Statistics. Tr→Train set; Dv→Development set; Ts→Test set;

Table 8: Dataset statistics for MOSI (Zadeh et al., 2016) and MOSEI (Zadeh et al., 2018c).

X

X

X

XRowsoftmax

Rowsoftmax

*

*

*X

Figure 3: MMMU-BAVT attention computation.

7.5 Precision-Recall (PR) curve

We illustrate the precision, recall & f-measure fordifferent input combinations in Figure 4 & Figure5.

3466

AV TA VT AVT

60

65

70

75

80

63.49

79.69 80.3181.28

60.91

78.9380.67 80.97

60.24

79.2780.48 81.12

Precision Recall F1 Score

Figure 4: Precision, Recall & F-measure for differ-ent input combinations in MMMU-BA architectureof MOSI dataset.

AV TA VT AVT

60

65

70

75

70.7372.03

73.3274.5

59.71

70.5669.42

68.5

64.75

71.28 71.31 71.37

Precision Recall F1 Score

Figure 5: Precision, Recall & F-measure for differ-ent input combinations in MMMU-BA architectureof MOSEI dataset.

Documents

Contextual Inter-modal Attention for Multi-modal Sentiment