From the Token to the Review: A Hierarchical Multimodal ... · Recent years have witnessed the increasing popu-larity of social networks and video streaming plat-forms. People heavily

HAL Id: hal-02371140https://hal.archives-ouvertes.fr/hal-02371140

Submitted on 19 Nov 2019

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

From the Token to the Review: A HierarchicalMultimodal approach to Opinion Mining

Alexandre Garcia, Pierre Colombo, Slim Essid, Florence d’Alché-Buc, ChloeClavel

To cite this version:Alexandre Garcia, Pierre Colombo, Slim Essid, Florence d’Alché-Buc, Chloe Clavel. From the Tokento the Review: A Hierarchical Multimodal approach to Opinion Mining. 2019 Conference on EmpiricalMethods in Natural Language Processing, Nov 2019, Hong-Kong, China. �hal-02371140�

https://hal.archives-ouvertes.fr/hal-02371140

https://hal.archives-ouvertes.fr

From the Token to the Review: A Hierarchical Multimodal approach toOpinion Mining

Alexandre Garcia ∗, Pierre Colombo∗†,Slim Essid, Florence d’Alché-Buc, Chloé Clavel

Télécom ParisTech, Université Paris Saclay† IBM GBS France

{garcia,sessid,fdalche,cclavel}@[email protected]

Abstract

The task of predicting fine grained user opin-ion based on spontaneous spoken language isa key problem arising in the development ofComputational Agents as well as in the devel-opment of social network based opinion min-ers. Unfortunately, gathering reliable data onwhich a model can be trained is notoriously dif-ficult and existing works rely only on coarselylabeled opinions. In this work we aim at bridg-ing the gap separating fine grained opinionmodels already developed for written languageand coarse grained models developed for spon-taneous multimodal opinion mining. We takeadvantage of the implicit hierarchical structureof opinions to build a joint fine and coarsegrained opinion model that exploits differentviews of the opinion expression. The resultingmodel shares some properties with attention-based models and is shown to provide compet-itive results on a recently released multimodalfine grained annotated corpus.

1 Introduction

Recent years have witnessed the increasing popu-larity of social networks and video streaming plat-forms. People heavily rely on these channels toexpress their opinions through video-based discus-sions or reviews. Whereas such opinionated datahas been widely studied in the context of writtencustomer reviews (Liu, 2012) crawled on websitessuch as Amazon (Hu and Liu, 2004) and IMDB(Maas et al., 2011), only a few studies have beenproposed in the case of video-based reviews. Suchmultimodal data has been shown to provide a meanto disambiguate some hard to understand opinionexpressions such as irony and sarcasm (Attardoet al., 2003) and contains crucial information in-dicating the level of engagement and the persua-siveness of the speaker (Clavel and Callejas, 2016;Ben Youssef et al., 2019; Nojavanasghari et al.,

∗ equal contribution

2016). A key problem in this context is the lackof availability of fine grained opinion annotationi.e. annotations performed at the token or shortspan level and highlighting on the components ofthe structure of opinions. Indeed whereas such re-sources have been gathered in the case of textualdata and can be used to deeply understand the ex-pression of opinions (Wiebe et al., 2005; Pontikiet al., 2016), the different attempts at annotatingmultimodal reviews have shown that reaching goodannotator agreement is nearly impossible at a finegrained level. This results from the disfluent aspectof spontaneous spoken language making it difficultto choose opinions’ annotation boundaries (Garciaet al., 2019; Langlet and Clavel, 2015b). Thus theprice to pay to gather reliable data is the definitionof an annotation scheme focusing on coarse grainedinformation such as long segment categorizationas done by Zadeh et al. (2016a) or review levelannotation (Park et al., 2014). Building modelsable to predict fine grained opinion information ina multimodal setting is in fact of high importancein the context of designing human–robot interfaces(Langlet and Clavel, 2016). Indeed the knowledgeof opinions decomposed over a set of polaritiesassociated to some targets is a building block ofautomatic human understanding pipelines (Langletand Clavel, 2015a). The present work is motivatedby the following observations:

• Despite the lack of reliability of fine grainedlabels collected for multimodal data, the re-dundancy of the opinion information con-tained at different granularities can be lever-aged to reduce the inherent noise of the la-belling process and to build improved opinionpredictors. We build a model that takes advan-tage of this property and joinlty models thedifferent components of an opinion.

• Hierarchical multi-task language models havebeen recently shown to improve upon the sin-

arX

iv:1

908.

1121

6v3

[cs

.CL

] 1

0 Se

p 20

19

gle tasks’ models (Sanh et al., 2018). A care-ful choice of the tasks and the order in whichthey are sequentially presented to the modelhas been proved to be the key to build compet-itive predictors. It is not clear whether suchtype of hierarchical model could be adaptedto handle multimodal data with the state of theart neural architectures (Zadeh et al., 2018a,b).We discuss in the experimental section thestrategies and models that are adapted to themultimodal opinion mining context.

• In the case where no fine grained supervi-sion is available, the attention mechanism(Vaswani et al., 2017) provides a compellingalternative to build models generating inter-pretable decisions with token-level explana-tions (Hemamou et al., 2018). In practice suchmodels are notoriously hard to train and re-quire the availability of very large datasets.On the other hand, the injection of fine-grained polarity information has been shownto be a key ingredient to build competitive sen-timent predictors by Socher et al. (2013). Ourhierarchical approach can be interpreted un-der the lens of attention-based learning wheresome supervision is provided at training tocounterbalance the difficulty of learning mean-ingful patterns with spoken language data. Wespecifically experimentally show that provid-ing this supervision is here necessary to buildcompetitive predictors due to the limited num-ber of data and the difficulty to extract mean-ingful patterns from it.

2 Background on fine grained opinionmining

The computational models of opinion are groundedin a linguistic framework defining how these ob-jects can be structured over a set of interdependentfunctional parts. In this work we focus on themodel of Martin and White (2013) that defines theexpression of opinions as an evaluation towards anobject. The expression of such evaluations can besummarized by the combination of three compo-nents: a source (mainly the speaker) expressing astatement on a target identifying the entity evalu-ated and a polarized expression making the attitudeof the source explicit. In the literature, the taskof finding the words indicating these componentsand categorizing them using a set of predefinedpossible targets and polarities has been studied un-

der the name of Aspect Based Sentiment Analysis(ABSA) and popularized by the SEMEVAL cam-paigns (Pontiki et al., 2016). They defined a set oftasks including sentence-level prediction. AspectCategory Detection consists in finding the target ofan opinion from a set of possible entities; OpinionTarget Expression is a sequence tagging problemwhere the goal is to find the word indicating thisentity; and Sentiment Polarity recognition is a clas-sification task where the predictor has to determinewhether the underlying opinion is positive, negativeor neutral. Such problems have also been extendedat the text level (text-level ABSA) where the partici-pants were asked to predict a set of tuples (Entitycategory, Polarity level) summarizing the opinionscontained in a review. In this work we adapt thesetasks to a recently released fine-grained multimodalopinion mining corpus and study a category of hier-archical neural architecture able to jointly performtoken-level, sentence-level and review-level predic-tions. In the next sections, we present the dataavailable and the definition of the different tasks.

3 Data description and model

This work relies on a set of fine and coarse grainedopinion annotations gathered for the PersuasiveOpinion Multimedia (POM) corpus presented inGarcia et al. (2019). The dataset is composed of1000 videos carrying a strong opinion content: ineach video, a single speaker in frontal view makesa critique of a movie that he/she has watched.The corpus contains 372 unique speakers and 600unique movie titles. The opinion of each speakerhas been annotated at 3 levels of granularity asshown in Figure 1.

At the finest (Token) level, the annotators indi-cated for each token whether it is responsible forthe understanding of the polarity of the sentenceand whether it describes the target of an opinion.On top of this, a span-level annotation contains acategorization of both the target and the polarityof the underlying opinion in a set of predefinedpossible target entities and polarity valences. Atthe review level (or text-level since the annotationsare aligned with the tokens of the transcript), anoverall score describes the attitude of the reviewerabout the movie.

As Garcia et al. (2019) have shown that theboundaries of span-level annotations are unreli-able, we relax the corresponding boundaries atthe sentence level. This sentence granularity is

Figure 1: Structure of an annotated opinion

in our data the intermediate level of annotation be-tween the token and the text. In practice, theseintermediate level labels can be modeled by tuplessuch as the one provided in the text-level ABSASEMEVAL task which are given for each sentencein the dataset. In what follows, we will refer to theproblem of predicting such information as the sen-tence level-prediction problem. Details concerningthe determination of the sentence boundaries andthe associated pre-processing of the data are givenin the supplemental material.

The representation described above can benaturally converted into a mathematical rep-resentation: A review x(i), i ∈ {1, . . . , N}is made of Si sentences each containing WSi

words. Thus the canonical feature represen-tation of a review is the following x(i) =

{{x(i)1,1, . . . , x(i)1,WS1

}, . . . , {x(i)Si,1, . . . , x

(i)Si,WSi

}},where each x is the feature representation of aspoken word corresponding to the concatenationof a textual, audio and video feature representation.It has been shown in (Zadeh et al., 2018a, 2016a,b)that whereas the textual modality carries the mostinformation, taking into account video and audiomodalities is mandatory to obtain state of the artresults on sentiment analysis problems. Based onthis input description, the learning task consists infinding a parameterized function gθ : X → Y thatpredicts various components of an opinion y ∈ Ybased on an input review x ∈ X . The parametersof such a function are obtained by minimizing anempirical risk:

θ = minθ

N∑i=1

l(gθ(x(i)),y(i)), (1)

where l is a non-negative loss function penalizingwrong predictions. In general the loss l is cho-sen as a surrogate of the evaluation metric whosepurpose is to measure the similarity between the

predictions and the true labels. In the case of com-plex objects such as opinions, there is no naturalmetric for measuring such proximity and we relyinstead on distances defined on substructures of theopinion model. To introduce these distances, wefirst decompose the label-structures following themodel previously described:

• Token-level labels are represented by a se-quence of 2-dimensional binary label vec-

tors y(i),Tokj,k =

(y(i),Polj,k

y(i),Tarj,k

)where y(i),Pol

j,k and

y(i),Tarj,k are some binary variables indicating

respectively whether the kth word of the sen-tence j in review i is a word indicating thepolarity of an opinion , and the target of anopinion.

• Sentence-level labels carry 2 pieces of infor-mation: (1) the categorization of the targetentities mentioned in an opinion expressedis represented by an E dimensional binaryvector y(i),Ent

j where each component encodesthe presence of an entity among E possiblevalues; and (2) the polarity of the opinionscontained in the sentence are represented by a4-dimensional one-hot vector y(i),Val

j encod-ing the possible valences: Positive, Nega-tive, Neutral/Mixed and None. Thus the sen-tence level label y(i),Sent

j is the concatenationof the two representations presented above:

y(i),Sentj =

(y(i),Entj

y(i),Valj

)• Text-level labels are composed of a single

continuous score obtained for each reviewy(i),T ex summarizing the overall rating givenby the reviewer to the movie described.

Based on these representations, we define a setof losses, l(Tok), l(Sent), l(Tex) dedicated to measur-

ing the similarity of each substructure prediction,y(Tok), y(Sent), y(Tex) with the ground-truth. In thecase of binary variables and in the absence of priorpreference between targets and polarities, we usethe negative log-likelihood for each variable. Eachtask loss is then defined as the average of the nega-tive log-likelihood computed on the variables thatcompose it. For continuous variables, we use themean squared error as the task loss. Consequentlythe losses to minimize can be expressed as:

l(Tok)(yTok, yTok) = −1

2

∑i

((yPoli log(yPoli )+

yTari log(yiTar)

),

l(Sent)(ySent, ySent) = −1

2

∑i

(yEnti log(yEnt

i )+

yVali log(yVal

i )),

l(Tex)(yTex, yTex) = (yTex − yTex)2,

Following previous works on multi-task learn-ing (Argyriou et al., 2007; Ruder, 2017), we ar-gue that optimizing simultaneously the risks de-rived from these losses should improve the results,compared to the case where they are treated sep-arately, due to the knowledge transferred acrosstasks. In the multi-task setting, the loss l de-rived from a set of task losses l(t), is a convexcombination of these different task losses. Herethe tasks corresponds to each granularity level:t ∈ Tasks = {Tok, Sent,Tex} weighted accordingto a set of task weights λt :

l(y, y) =

∑t∈Tasks λtl

(t)(yt, yt)∑t∈Tasks λt

, ∀λt ≥ 0. (2)

Optimizing this type of objectives in the case ofhierarchical deep net predictors requires buildingsome strategy in order to train the different partsof the model: the low level parts as well as theabstract ones. We discuss such an issue in the nextsection.

4 Learning strategies for multitaskobjectives

The main concern when optimizing objectives ofthe form of Equation 2 comes from the variabledifficulty in optimizing the different objectives l(t).Previous works (Sanh et al., 2018) have shown thata careful choice of the order in which they are in-troduced is a key ingredient to correctly train deephierarchical models. In the case of hierarchical

labels, a natural hierarchy in the prediction com-plexity is given by the problem. In the task at hand,coarse grained labels are predicted by taking ad-vantage of the information coming from predictingfine grained ones. The model processes the text byrecursively merging and selecting the informationin order to build an abstract representation of the re-view. In Experiment 1 we show that incorporatingthese fine grained labels into the learning processis necessary to obtain competitive results from theresulting predictors. In order to gradually guide themodel from easy tasks to harder ones, we parame-terize each λt as a function of the number of epochsof the form λ

(nepoch)t = λmax

exp ((nepoch−Nst)/σ)1+exp ((nepoch−Nst)/σ)

where Nst is a parameter devoted to task t control-ling the number of epochs after which the weightswitches to λmax and σ is a parameter controllingthe slope of the transition. We construct 4 strate-gies relying on smooth transitions from a low stateλ(0)t = 0 to a high state λ(Nst)t = λmax

t of each taskweight varying with the number of epochs. Thedifferent strategies described below are illustratedin the supplemental material.

• Strategy 1 (S1) consists in optimizing thedifferent objectives one at a time from theeasiest to the hardest. It consists in firstmoving vector (λToken, λSentence, λText)

T val-ues from (1, 0, 0)T to (0, 1, 0)T and then fi-nally to (0, 0, 1)T . The underlying idea is thatthe low level labels are only useful as an ini-tialization point for higher level ones.

• Strategy 2 (S2) consists in adding sequen-tially the different objectives to each otherfrom the easiest to the hardest. It goes froma word only loss (λToken, λSentence, λText)

T =

(λ(N)Token, 0, 0)

T and then adds the intermediateobjectives by setting λSentence to λ(N)

Sentence andthen λText to λ

(N)Text . This strategy relies on

the idea that keeping a supervision on lowlevel labels has a regularizing effect on highlevel ones. Note that this strategy and the twofollowing require a choice of the stationaryweight values λ(N)

Token, λ(N)Sentence, λ

(N)Text .

• Strategy 3 (S3) is similar to (S2) except thatthe sentence and text weights are simultane-ously increased. This strategy and the fol-lowing one are introduced to test whether theorder in which the tasks are introduced hassome importance on the final scores.

• Strategy 4 (S4) is also similar to (S2) exceptthat text-level supervision is introduced beforethe sentence-level one. This strategy uses theintermediate level labels as a way to regularizethe video level model that would have beenlearned directly after the token-level supervi-sion

These strategies can be implemented in any stochas-tic gradient training procedure of objectives (Equa-tion 2) since it only requires modifying the valuesof the weight at the end of each epoch. In the nextsection, we design a neural architecture that jointlypredicts opinions at the three different levels, i.e.the token, sentence and text levels, and discuss howto optimize multitask objectives built on top ofopinion-based output representations.

5 Architecture

Before digging into the model description,we introduce the set of hidden variablesh(i),Tex, h

(i),Sentj , h

(i),Tokj,k corresponding to the un-

constrained scores used to predict the out-puts: y(i),Tex = σTex(W Texh(i),Tex + bTex),y(i),Sentj = σSent(W Senth

(i)Sent

j + bSent), y(i),Tokj,k =

σTok(W Tokh(i),Tokj,k + bTok), where the W and b are

some parameters learned from data and the σ aresome fixed almost everywhere differentiable func-tions ensuring that the outputs “match” the inputsof the loss function. In the case of binary variablesfor example, it is chosen as the sigmoid functionσ(x) = exp(x)/(1+ exp(x)). From a general per-spective, a hierarchical opinion predictor is com-posed of 3 functions gTex, gSent, gTok encoding thedependency across the levels:

h(i),Tokj,k = gTok

θTok(x(i),Tokj,: ),

h(i)Sent

j = gSentθSent(h

(i)Tok

j,: ),

h(i)Tex

= gTexθTex(h

(i)Sent

: ).

In this setting, low level hidden representations areshared with higher level ones. A large body ofwork has focused on the design of the g functionsin the case of multimodal inputs. In this work weexploit state of the art sequence encoders to buildour hidden representations that we detail below.The mathematical expression of the models anda more in depth description are provided in thesupplemental material.

• Bidirectional Gated Recurrent Units (GRU)(Cho et al., 2014) especially when coupled

with a self attention mechanism have beenshown to provide state of the art results ontasks implying the encoding or decoding of asentence in or from a fixed size representation.Such a problem is encountered in automaticmachine translation (Luong et al., 2015), au-tomatic summarization (Nallapati et al., 2017)or image captioning and visual question an-swering (Anderson et al., 2018). We experi-ment with both models mixing the 3 concate-nated input feature modalities (BiGRU modelin Experiment 1) and a model carrying 3 inde-pendent BiGRU with a hidden state per modal-ity (Ind BiGRU models).

• The Multi-attention Recurrent Network(MARN) proposed in (Zadeh et al., 2018a)extends the traditional Long Short Term Mem-ory (LSTM) sequential model by both storinga view specific dynamic (similar to the LSTMone) and by taking into account cross-view dy-namics computed from the signal of the othermodalities. In the original paper, this cross-view dynamic is computed using a multi-attention bloc containing a set of weights foreach modality used to mix them in a jointhidden representation. Such a network canmodel complex dynamics but does not em-bed a mechanism dedicated to encoding verylong-range dependencies.

• Memory Fusion Networks (MFN) are a sec-ond family of multi-view sequential modelsbuilt upon a set of LSTM per modality feed-ing a joint delta memory. This architecturehas been designed to carry some informationin the memory even with very long sequencesdue to the choice of a complex retain / forgetmechanism.

The 3 models described previously build a hid-den representation of the data contained in eachsequence. The transfer from one level of the hi-erarchy to the next coarser one requires buildinga fixed length representation summarizing the se-quence. Note that in the case of the MARN and theMFN, the model directly creates such a representa-tion. We present the strategies that we deployed topool these representations in the case of the BiGRUsequential layer.

• Last state representation: Sequential modelsbuild their inner state based on observations

from the past. One can thus naturally use thehidden state computed at the last observationof a sequence to represent the entire sequence.In our experiments, this is the representationchosen for the BiGRU and Ind BiGRU mod-els.

• Attention based sequence summarization: An-other technique consists in computing aweighted sum of the hidden states of the se-quence. The attention weights can be learnedfrom data to focus on the important parts ofthe sequence only and avoid building too com-plex inner representations. An example ofsuch a technique successfully applied to thetask of text classification based on 3 levels ofrepresentation can be found in (Yang et al.,2016). In our experiments, we implementedthe attention model for predicting only theSentence-level labels (model Ind BiGRU + attSent) and the Sentence and Text-level labels bysharing a common representation (Ind BiGRU+ att model).

All the resulting architectures extend the existinghierarchical models by enabling the fusion of mul-timodal information at different granularity levelswhile maintaining the ability to introduce somesupervision at any level.

6 Experiments

In this section we propose 3 sets of experimentsthat show the superiority of our model over ex-isting approaches with respect to the difficultieshighlighted in the introduction, and explore thequestion of the best way to train hierarchical mod-els on multimodal opinion data.

All the results presented below have been ob-tained on the recently released fine grained anno-tated POM dataset (Garcia et al., 2019). The inputfeatures are computed using the CMU-MultimodalSDK: We represented each word by the concate-nation of the 3 feature modalities. The textual fea-tures are chosen as the 300-dimensional pre-trainedGlove embeddings (Pennington et al., 2014) (notupdated during training). The acoustic and visualfeatures have been obtained by averaging the de-scriptors computed following (Park et al., 2014)during the time of pronunciation of each spokenword. These features include MFCC and pitchdescriptors for the audio signals. For the videodescriptors, posture, head and gaze movement are

taken into account. As far as the output representa-tions are concerned, we merely re-scaled the Text-level polarity labels in the [0,1] range.

The results are reported in terms of mean averageerror (MAE) for the continuous labels and micro F1score µF1 for binary labels. We used the providedtrain, val and test set and describe for each experi-ment the training procedure and displayed valuesbelow. More detail concerning the preprocessingsand architectures can be found in the supplementalmaterial.

6.1 Experiment 1: Which architectureprovides the best results on the task offine grained opinion polarity prediction?

In this first section, we describe our protocol toselect an architecture devoted to performing finegrained multimodal opinion prediction. In orderto focus on a restricted set of possible models, weonly treat the polarity prediction problem in thissection and selected the architectures that providedthe best review-level scores (i.e. with lowest meanaverage prediction error). Taking into account theentity categories would only bring an additionallevel of complexity that is not necessary in thisfirst model selection phase. Building upon previ-ous works (Zadeh et al., 2018b), we use the MFNmodel as our sentence-level sequential model sinceit has been shown to provide state of the art re-sults on text-level prediction problems on the POMdataset. For the token-level model, we test differ-ent state of the art models able to take advantageof the multimodal information. Our architectureis built upon the token-level encoders presentedin section 5: the MFN, MARN and independentBiGRUs. Our baseline is computed similarly toZadeh et al. (2018a): we represent each sentenceby taking the average of the feature representationof the Tokens composing it. The best results re-ported were obtained after a random search on theparameters and presented in Table 1. In the toprow, we report results obtained when only usingthe text-level labels to train the entire network. Thebaseline consisting in representing each sentenceby the average of its tokens representation stronglyoutperforms all the other results. This is due to themoderate size of the training set (600 videos) whichis not enough to learn meaningful fine grained rep-resentations. In the second part, we introduce somesupervision at all levels and found that a choice ofλTok = 0.05, λSent = 0.5, λTex = 1 being respec-

λTok = λSent = 0: no fine grained supervisionMAE Text 0.35 0.40 0.40 0.38 0.29 0.32 0.17

Supervision at the token, sentence and review levels

MetricModel BiGRU Ind BiGRU Ind BiGRU + att Sent Ind BiGRU + att MARN MFN Av Emb

µF1 Tokens 0.90 0.93 0.93 0.93 0.90 0.89 XµF1 Sentence 0.68 0.72 0.75 0.75 0.52 0.47 X

MAE Text 0.16 0.15 0.15 0.14 0.35 0.37 X

Table 1: Scores on sentiment label

tively the token, sentence and text weights providesthe best text-level results. This combination reflectsthe fact that the main objective (text-level) shouldreceive the highest weight but low level ones alsoadd some useful side supervision. Despite the abil-ity of MARN and MFN to learn complex represen-tations, the simpler BiGRU-based Token encoderretrieves the best results at all the levels and pro-vides more than 12% of relative improvement overthe Average Embedding based model at the videolevel. This behavior reveals that the high complex-ity of MARN and MFN makes them hard to train inthe context of hierarchical models with limited dataleading to suboptimal performance against simplerones such as BiGRU. We fix the best architectureobtained in this experiment displayed in Figure 2and reuse it in the subsequent experiments.

6.2 Experiment 2: What is the best strategyto take into account multiple levels ofopinion information?

Figure 3: Path of the weight vector in the simplex trian-gle for the different tested strategies

Motivated by the issues concerning the trainingof multitask losses raised in Section 4, we im-

plemented the 4 strategies described and chosethe final stationary values as the best one ob-tained in Experiment 1: (λ(N)

Token, λ(N)Sentence, λ

(N)Text ) =

(0.05, 0.5, 1) Note that each strategy correspondsto a path of the vector (λTok, λSent, λTex)

T /∑

t λtin the 3 dimensional simplex. We represent the 3strategies tested in the Figure 3 corresponding tothe projection of the weight vector onto the hyper-plane containing the simplex.

The best paths for optimizing the text-level ob-jectives are the one that smoothly move from acombination of sentence and token-level objectivesto a text oriented one. The path in the simplexseems to be more important than the nature of thestrategy since S1 and S2 reach the same text-levelMAE score while working differently. It also ap-pears than an objective with low σ1 values corre-sponding to harder transitions tends to obtain lowerscores than smooth transition based strategies. Allthe strategies are displayed as a function of thenumber of epochs in the supplemental material. Inthis last section we deal with the issue of the jointprediction of entities and polarities.

6.3 Experiment 3: Is it better to jointlypredict opinions and entities ?

In this section, we introduce the problem of pre-dicting the entities of the movie on which the pre-dictions are expressed, as well as the tokens thatmention them. This task is harder than the previ-ously studied polarity prediction task due to (1)the problem of label imbalance appearing in thelabel distribution reported in the Table 3 and (2) thediversity of the vocabulary incurred when dealingwith many entities. However since the presence ofa polarity implies the presence of at least one en-tity, we expect that a joint prediction will performbetter than an entitiy-based predictor only. Table 2contains the results obtained with the architecturedescribed in Figure 2 on the task of joint polarity

1described in Section 4

Figure 2: Best architecture selected during the Experiment 1

and entity prediction as well as the results obtainedwhen dealing with these tasks independently.

Using either the joint or the independent modelsprovides the same results on the polarity predic-tion problems at the token and sentence-level. Thereason is that the polarity prediction problem iseasier and relying on the entities prediction wouldonly introduce some noise in the prediction. We

Polaritylabels

Entitylabels

Polarity +entities

F1 polaritytokens

0.93 X 0.93

F1 polarityvalence

0.75 X 0.75

F1 entitiestokens

X 0.97 0.97

F1 entitiesEntities

X Table 3 Table 3

MAE scorereview level

0.14 0.38 0.14

Table 2: Joint and independent prediction of entitiesand polarities

detail the case of Entities in the Table 3 and presentthe results obtained for the most common entitycategories (among 11). As expected, the entity pre-diction tasks benefits from the polarity informationon most of the categories except for the Vision andspecial effects. A 5% of relative improvement canbe noted on the two most present Entities: Overalland Screenplay.

EntityEntity +Polarity

ValueCount

Overall 0.71 0.73 1985Actors 0.65 0.65 493

Screenplay 0.60 0.63 246Atmosphereand mood

0.62 0.64 151

Vision andspecial effects

0.62 0.58 154

Table 3: F1 score per label for the top entity categoriesannotated at the sentence level (mean score averagedover 7 runs), value counts are provided on the test set.

7 Conclusion

The proposed framework enables the joint predic-tion of the different components of an opinionbased on a hierarchical neural network. The re-sulting models can be fully or partially supervisedand take advantage of the information providedby different views of the opinions. We have ex-perimentally shown that a good learning strategyshould first rely on the easy tasks (i.e. for whichthe labels do not require a complex transformationof the inputs) and then move to more abstract tasksby benefiting from the low level knowledge. Fu-ture work will explore the use of structured outputlearning methods dedicated to the opinion struc-ture.

8 Acknowledgements

We would like to thanks Thibaud Besson and thewhole french Cognitive Systems team of IBM forsupporting our research with the server IBM PowerAC922.

ReferencesPeter Anderson, Xiaodong He, Chris Buehler, Damien

Teney, Mark Johnson, Stephen Gould, and LeiZhang. 2018. Bottom-up and top-down attention forimage captioning and visual question answering. InProceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 6077–6086.

Andreas Argyriou, Theodoros Evgeniou, and Massim-iliano Pontil. 2007. Multi-task feature learning. InAdvances in neural information processing systems,pages 41–48.

Salvatore Attardo, Jodi Eisterhold, Jennifer Hay, andIsabella Poggi. 2003. Multimodal markers of ironyand sarcasm. Humor, 16(2):243–260.

A. Ben Youssef, C. Clavel, and S. Essid. 2019. Earlydetection of user engagement breakdown in sponta-neous human-humanoid interaction. IEEE Transac-tions on Affective Computing, pages 1–1.

Kyunghyun Cho, Bart Van Merriënboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using rnn encoder-decoderfor statistical machine translation. Proceedings ofthe Empiricial Methods in Natural Language Pro-cessing (EMNLP 2014).

Chloe Clavel and Zoraida Callejas. 2016. Sentimentanalysis: from opinion mining to human-agent inter-action. IEEE Transactions on affective computing,7(1):74–93.

Alexandre Garcia, Slim Essid, Florence D’alché-Buc,and Chloé Clavel. 2019. A multimodal movie re-view corpus for fine-grained opinion mining. arxivPreprint: arXiv:1902.10102.

Leo Hemamou, Ghazi Felhi, Vincent Vandenbuss-che, Jean-Claude Martin, and Chloé Clavel. 2018.Hirenet: a hierarchical attention model for the auto-matic analysis of asynchronous video job interviews.In AAAI 2019. ACM.

Minqing Hu and Bing Liu. 2004. Mining and summa-rizing customer reviews. In Proceedings of the tenthACM SIGKDD international conference on Knowl-edge discovery and data mining, pages 168–177.ACM.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Caroline Langlet and Chloé Clavel. 2015a. Adaptingsentiment analysis to face-to-face human-agent inter-actions: from the detection to the evaluation issues.In Affective Computing and Intelligent Interaction(ACII), 2015 International Conference on, pages 14–20. IEEE.

Caroline Langlet and Chloé Clavel. 2015b. Improv-ing social relationships in face-to-face human-agentinteractions: when the agent wants to know user’slikes and dislikes. In Proceedings of the 53rd An-nual Meeting of the Association for ComputationalLinguistics and the 7th International Joint Confer-ence on Natural Language Processing (Volume 1:Long Papers), volume 1, pages 1064–1073.

Caroline Langlet and Chloé Clavel. 2016. Ground-ing the detection of the user’s likes and dislikeson the topic structure of human-agent interactions.Knowledge-Based Systems, 106:116–124.

Bing Liu. 2012. Sentiment analysis and opinion min-ing. Synthesis lectures on human language technolo-gies, 5(1):1–167.

Minh-Thang Luong, Hieu Pham, and Christopher DManning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprintarXiv:1508.04025.

Andrew L Maas, Raymond E Daly, Peter T Pham, DanHuang, Andrew Y Ng, and Christopher Potts. 2011.Learning word vectors for sentiment analysis. InProceedings of the 49th annual meeting of the as-sociation for computational linguistics: Human lan-guage technologies-volume 1, pages 142–150. Asso-ciation for Computational Linguistics.

James R Martin and Peter R White. 2013. The lan-guage of evaluation, volume 2. Springer.

Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017.Summarunner: A recurrent neural network based se-quence model for extractive summarization of docu-ments. In Thirty-First AAAI Conference on ArtificialIntelligence.

Behnaz Nojavanasghari, Deepak Gopinath, JayanthKoushik, Tadas Baltrušaitis, and Louis-PhilippeMorency. 2016. Deep multimodal fusion for per-suasiveness prediction. In Proceedings of the 18thACM International Conference on Multimodal Inter-action, pages 284–288. ACM.

Sunghyun Park, Han Suk Shim, Moitreya Chatterjee,Kenji Sagae, and Louis-Philippe Morency. 2014.Computational analysis of persuasiveness in socialmultimedia: A novel dataset and multimodal predic-tion approach. In Proceedings of the 16th Interna-tional Conference on Multimodal Interaction, pages50–57. ACM.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for word rep-resentation. In Proceedings of the 2014 conferenceon empirical methods in natural language process-ing (EMNLP), pages 1532–1543.

https://doi.org/10.1109/TAFFC.2019.2898399



Maria Pontiki, Dimitris Galanis, Haris Papageor-giou, Ion Androutsopoulos, Suresh Manandhar, AL-Smadi Mohammad, Mahmoud Al-Ayyoub, YanyanZhao, Bing Qin, Orphée De Clercq, et al. 2016.Semeval-2016 task 5: Aspect based sentiment anal-ysis. In Proceedings of the 10th international work-shop on semantic evaluation (SemEval-2016), pages19–30.

Sebastian Ruder. 2017. An overview of multi-tasklearning in deep neural networks. arXiv preprintarXiv:1706.05098.

Victor Sanh, Thomas Wolf, and Sebastian Ruder. 2018.A hierarchical multi-task approach for learning em-beddings from semantic tasks. The Thirty-ThirdAAAI Conference on Artificial Intelligence (AAAI2019).

Richard Socher, Alex Perelygin, Jean Wu, JasonChuang, Christopher D Manning, Andrew Ng, andChristopher Potts. 2013. Recursive deep modelsfor semantic compositionality over a sentiment tree-bank. In Proceedings of the 2013 conference onempirical methods in natural language processing,pages 1631–1642.

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,Ilya Sutskever, and Ruslan Salakhutdinov. 2014.Dropout: a simple way to prevent neural networksfrom overfitting. The Journal of Machine LearningResearch, 15(1):1929–1958.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in Neural Information Pro-cessing Systems, pages 5998–6008.

Janyce Wiebe, Theresa Wilson, and Claire Cardie.2005. Annotating expressions of opinions and emo-tions in language. Language resources and evalua-tion, 39(2-3):165–210.

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,Alex Smola, and Eduard Hovy. 2016. Hierarchicalattention networks for document classification. InProceedings of the 2016 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,pages 1480–1489.

A Zadeh, PP Liang, S Poria, P Vij, E Cambria, andLP Morency. 2018a. Multi-attention recurrent net-work for human communication comprehension. InAAAI.

Amir Zadeh, Paul Pu Liang, Navonil Mazumder,Soujanya Poria, Erik Cambria, and Louis-PhilippeMorency. 2018b. Memory fusion network for multi-view sequential learning. In Thirty-Second AAAIConference on Artificial Intelligence.

Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. 2016a. Mosi: multimodal cor-pus of sentiment intensity and subjectivity anal-ysis in online opinion videos. arXiv preprintarXiv:1606.06259.

Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. 2016b. Multimodal sentiment in-tensity analysis in videos: Facial gestures and verbalmessages. IEEE Intelligent Systems, 31(6):82–88.

A Supplemental Material for the paper:From the Token to the Review: A jointHierarchical Multimodal approach toOpinion Mining

A.1 Preprocessing details• Matching features and annotations: In all ourexperiments we reused the descriptors presentedoriginally in (Park et al., 2014) and made availablein the CMU-Multimodal SDK. The annotation cam-paign performed in (Garcia et al., 2019) had beenrun on the unprocessed transcripts of the spokenreviews. In order to match the setting describedin previous work, we transposed the fine grainedannotations from the unprocessed dataset to theprocessed one in the following way: We first com-puted the Levenstein distance (minimum numberof insertion/deletion/replacement needed to trans-form a sequence of items into another) between thesequence of Tokens of the processed and unpro-cessed transcripts. Then we applied the sequenceof transformations minimizing this distance on thesequence of annotation tags to build the equivalentsequence of annotation on the processed dataset.• Long sentences treatment: We first removed

the punctuation (denoted by the ’sp’ token in theprovided featurized dataset) in order to limit themaximal sentence length in the dataset. For theremaining sentences exceding 50 tokens we alsoapplied the following treatment: We ran the sen-tence splitter from the spaCy library. The result-ing subsentences are then kept each time they arecomposed of more than 4 tokens (overwise thegroups of 4 tokens were merged with the next sub-sentence).• Input features cliping: The provided feature

alignment code retrieved some infinite values andimpossible assignments. We clipped the values tothe range [-30,30] and replaced impossible assign-ments by 0.• Training, validation and test folds: We

used the original standard folds availableat: https://github.com/A2Zadeh/CMU-MultimodalSDK/blob/master/mmsdk/mmdatasdk/dataset/standard_datasets/POM/pom_std_folds.py

A.2 Architecture detailsIn this section we detail the structure of the dif-ferent architectures tested in Experiment 1. Ac-cording to the notations of the paper, we de-tail especially how the hidden representations

h(i),Tex, h(i),Sentj , h

(i),Tokj,k are computed in practice.

A.2.1 token-level model• BiGRU based models (For the sake of simplicitywe consider only one direction for the equations)

The hidden state of a Gated recurrent unit at timet: hjt is computed based on the previous state hjt−1and a new candidate state hjt−1:

hjt = (1− ztj)hjt−1 + zjt h

jt−1

Where zjt is an update vector controlling how muchthe state is updated :

ztj = σ(Wzxt + Uzht−1)

j

The candidate state is computed by a simple recur-rent unit with an additional reset gate rt:

hjt = (tanh(Wxt + U(rt)� ht−1))j

� is the element wise product and rt is defined by:

rjt = σ(Wrxt + Urht−1)j

.

• In the case of the BiGRU model of Table 1,the input objects xt are the concatenation ofthe 3 feature representations: xt = xtextual

t ⊕xaudiot ⊕ xvisual

t

• In the case of the Ind BiGRU Model, 3 BiGRUrecurrent models are trained independently oneach input modality and the hidden representa-tion shared with the next parts of the networkis the concatenation of the 3 hidden states:ht = htextual

t ⊕ haudiot ⊕ hvisual

t

For these models, an entire sentence is encodedthanks to the state computed at the last token ofthe sentence. In the case of the BiGRU Ind + Attmodel, an additional attention model is used: itfirst computes a score per token ujt indicating itsrelative contribution:

ujt = tanh(Wwhjt + bw)

These scores are then rescaled as a probability dis-tribution with a softmax over the entire sequence:

αt =hTt ut∑tjhTtjutj

https://github.com/A2Zadeh/CMU-MultimodalSDK/blob/master/mmsdk/mmdatasdk/dataset/standard_datasets/POM/pom_std_folds.py




These weights are then used to pool the hidden staterepresentations of the sequence in a fixed lengthvector:

hPool =∑t

αtht

This last representation is then used to feed thesentence level recurrent model (here a Memoryfusion network). Note that the attention modeldoes not erase the information about the modalitynature of each component of hPool so that it can beused with a model taking into account this nature.•MARN modelThe Multi-Attention Recurrent Network from

Zadeh et al. (2018a) relies on 2 components:

• The Long Short Term Hybrid Memory is aLSTM model where the hidden state is theconcatenation of a local hidden state (com-puted using the classical LSTM layer) and anexternal hidden state computed using a MultiAttention Block (MAB).

• The MAB block computes the external hid-den state by computing 3 weighted sum ofthe input hidden states (one set of attentionweights is computed per modality) which arethen passed in 3 feedforward networks. Theoutputs of these network are concatenated andthen passed in a second network to producethe final joint hidden representation. The de-tailed equations can be found in the originalpaper.

Similarly to the original paper we use the last hid-den state computed from an entire sequence to rep-resent it.•MFN modelThe Memory Fusion Network (Zadeh et al.,

2018b) is made of 3 blocks:

• Each modality based sequence of feature isrepresented by the hidden state of a LSTMmodel. These hidden state are fed in the nextpart of the model:

• A delta attention memory takes the concatena-tion of two consecutive input vectors (takenfrom the sequence of hidden representationsof the LSTM) which are fed to a feedforwardmodel to compute an attention score for eachcomponent of these inputs. The name deltamemory is only indicating the fact that theinputs are taken by pairs of inputs.

• The output of the attention layer is then sentto a Multi-view Gated Memory generalizingthe GRU layer to multiview data by takinginto account a modality specific and a crossmodality hidden representations.

The MFN model is our common model at thesentence-level.

A.3 HyperparametersAll the hyper-parameters have been optimized onthe validation set using MAE score at text level.Architecture optimization has been done using arandom search with 15 trials. We used Adam opti-mizer (Kingma and Ba, 2014) with a learning rateof 0.01, which is updated using a scheduler witha patience of 20 epochs and a decrease rate of 0.5(one scheduler per classifier and per encoder). Thegradient norm is clipped to 5.0, weight decay isset to 1e-5, and dropout (Srivastava et al., 2014)is set to 0.2. Models have been implemented inPyTorch and they have been trained on a singleIBM Power AC922. The best performing MFN hasa 4 attentions, the cellule size for the video is set to48, for the audio to 32, for the text to 64. Memorydimension is set to 32, windows dimension to 2,hidden size of first attention is set to 32, hiddensize of second attention is set to 16, γ1 is set to 64,γ2 is set to 32 2.

A.4 Experiment 2: Strategies displayedIn this section we report the detailed expression ofthe λ(nepoch) displayed in the figure 3.

A.4.1 Strategy 1In the strategy 1, the task losses are activated oneat a time following the equations:

λnepochToken = 1−

exp ((nepoch −NsToken)/σ)

1 + exp ((nepoch −NsToken)/σ)

λnepochSentence =


1 + exp ((nepoch −NsToken)/σ)−

exp ((nepoch −NsSentence)/σ)

1 + exp ((nepoch −NsSentence)/σ)

λnepochText =



We report the graphs of the corresponding strate-gies as a function of the number of epochs in theFigures 4,5 and 6.

2For exact meaning of each parameter please refer to theofficial implementation which can be found here: https://github.com/pliang279/MFN and in the work of Zadehet al. (2018b)

https://github.com/pliang279/MFN

https://github.com/pliang279/MFN

0 10 20 30 40 50 60 70 80n epoch

0.0

0.2

0.4

0.6

0.8

1.0 w

eigh

ts

Strategy 1, = 1

Token

Sentence

Text

Figure 4: Strategy 1, σ = 1

0 10 20 30 40 50 60 70 80n epoch

0.0

0.2

0.4

0.6

0.8

1.0

wei

ghts

Strategy 1, = 5

Token

Sentence

Text


A.4.2 Strategy 2In the strategy 2, the task losses are sequentiallyactivated and maintained following the equations:

λnepochToken = 0.05

λnepochSentence = 0.5



λnepochText =




A.4.3 Strategy 3In the strategy 3, the Sentence and Text losses areactivated at the same time:

λnepochToken = 0.05

λnepochSentence = 0.5



λnepochText =




0 10 20 30 40 50 60 70 80n epoch

0.0

0.2

0.4

0.6

0.8

1.0

wei

ghts

Strategy 1, = 10Token

Sentence

Text


0 10 20 30 40 50 60 70 80n epoch

0.0

0.2

0.4

0.6

0.8

1.0

wei

ghts


Sentence

Text


0 10 20 30 40 50 60 70 80n epoch

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

wei

ghts

Strategy 2, = 5

Token

Sentence

Text


0 10 20 30 40 50 60 70 80n epoch

0.1

0.2

0.3

0.4

0.5

0.6

wei

ghts

Strategy 2, = 10

Token

Sentence

Text


0 10 20 30 40 50 60 70 80n epoch

0.0

0.2

0.4

0.6

0.8

1.0

wei

ghts


Sentence

Text


0 10 20 30 40 50 60 70 80n epoch

0.1

0.2

0.3

0.4

0.5

0.6

wei

ghts

Strategy 3, = 5

Token

Sentence

Text


0 10 20 30 40 50 60 70 80n epoch

0.1

0.2

0.3

0.4

0.5

0.6

wei

ghts


Sentence

Text


Documents

From the Token to the Review: A Hierarchical Multimodal ... · Recent years have witnessed the increasing popu-larity of social networks and video streaming plat-forms. People heavily