UR-FUNNY: A Multimodal Language Dataset for Understanding … · to mirror the laughter (Provine,1992). Modeling humor using a computational frame-work is inherently challenging due

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processingand the 9th International Joint Conference on Natural Language Processing, pages 2046–2056,Hong Kong, China, November 3–7, 2019. c©2019 Association for Computational Linguistics

2046

UR-FUNNY: A Multimodal Language Dataset for Understanding Humor

Md Kamrul Hasan1*, Wasifur Rahman1*, Amir Zadeh2, Jianyuan Zhong1,Md Iftekhar Tanveer1 †, Louis-Philippe Morency2, Mohammed (Ehsan) Hoque1

1 - Department of Computer Science, University of Rochester, USA2 - Language Technologies Institute, SCS, CMU, USA

{mhasan8,echowdh2}@cs.rochester.edu, [email protected],[email protected], [email protected],

[email protected], [email protected]

I can put a bizarre new idea in your mind right now.

I can say, imagine a jellyfish waltzing in a library while thinking

about quantum mechanics.

Now, if everything has gone relatively well in your life,

you probably haven’t had that thought before.

So because of this ability, we humans are able to transmit knowledge across

time.

Your brain takes those vibrations from

your eardrums and transforms them into

thoughts.

Video

Audio

Text

Humor?YesNo

Context Punchline

t=0:41 t=1:15 t=1:20t=1:09t=1:03t=0:52

Figure 1: An example of the UR-FUNNY dataset. UR-FUNNY presents a framework to study the dynamics ofhumor in multimodal language. Machine learning models are given a sequence of sentences with the accompanyingmodalities of visual and acoustic. Their goal is to detect whether or not the sequence will trigger immediatelaughter after the punchline.

Abstract

Humor is a unique and creative communica-tive behavior often displayed during social in-teractions. It is produced in a multimodal man-ner, through the usage of words (text), gestures(visual) and prosodic cues (acoustic). Un-derstanding humor from these three modali-ties falls within boundaries of multimodal lan-guage; a recent research trend in natural lan-guage processing that models natural languageas it happens in face-to-face communication.Although humor detection is an establishedresearch area in NLP, in a multimodal con-text it has been understudied. This paperpresents a diverse multimodal dataset, calledUR-FUNNY, to open the door to understand-ing multimodal language used in expressinghumor. The dataset and accompanying stud-ies, present a framework in multimodal humordetection for the natural language processingcommunity. UR-FUNNY is publicly availablefor research.

1 Introduction

Humor is a unique communication skill that re-moves barriers in conversations. Research showsthat effective use of humor allows a speakerto establish rapport (Stauffer, 1999), grab atten-tion (Wanzer et al., 2010), introduce a difficultconcept without confusing the audience (Garner,2005) and even to build trust (Vartabedian andVartabedian, 1993). During face-to-face com-munications, humor involves multimodal com-municative channels including effective use ofwords (text), accompanying gestures (visual) andprosodic cues (acoustic). Being able to mixand align those modalities appropriately is oftenunique to individuals, attributing to many differ-ent styles. Styles include gradually building upto a punchline using words, gestures and prosodiccues, a sudden twist to the story with an unex-

* - Equal contribution ; †- Currently affiliated withComcast Applied AI Research, Washington DC, USA.

2047

pected punchline (Ramachandran, 1998), creatinga discrepancy between modalities (e.g., somethingfunny being said without any emotion), or justlaughing with the speech to stimulate the audienceto mirror the laughter (Provine, 1992).

Modeling humor using a computational frame-work is inherently challenging due to factors suchas: 1) Idiosyncrasy: often humorous people arealso the most creative ones (Hauck and Thomas,1972). This creativity in turn adds to the dynamiccomplexity of how humor is expressed in a multi-modal manner. Use of words, gestures, prosodiccues and their (mis)alignments are choices that acreative user often experiments with. 2) Contex-tual Dependencies: humor often develops throughtime as speakers plan for a punchline in advance.There is a gradual build up in the story with asudden twist using a punchline (Ramachandran,1998). Some punchlines when viewed in isola-tion (as illustrated in Figure 1) may not appearfunny. The humor stems from the prior build up,cross-referencing multiple sources, and its deliv-ery. Therefore, a full understanding of humor re-quires analyzing the context of the punchline.

Understanding the unique dependencies acrossmodalities and its impact on humor require knowl-edge from multimodal language; a recent researchtrend in the field of natural language processing(Zadeh et al., 2018b). Studies in this area aim toexplain natural language from three modalities oftext, visual and acoustic. In this paper, alongsidecomputational descriptors for text, gestures suchas smile or vocal properties such as loudness aremeasured and put together in a multimodal frame-work to define humor recognition as a multimodaltask.

The main contribution of this paper to the NLPcommunity is introducing the first multimodal lan-guage (including text, visual and acoustic modal-ities) dataset of humor detection named “UR-FUNNY”. This dataset opens the door to un-derstanding and modeling humor in a multimodalframework. The studies in this paper present per-formance baselines for this task and demonstratethe impact of using all three modalities togetherfor humor modeling.

2 Background

The dataset and experiments in this paper are con-nected to the following areas:Humor Analysis: Humor analysis has been

Dataset #Pos #Neg Mod type #spk16000 One-Liners 16000 16000 {t} joke -Pun of the Day 2423 2423 {t} pun -PTT Jokes 1425 2551 {t} political -Ted Laughter 4726 4726 {t} speech 1192Big Bang Theory 18691 24981 {t,a} tv show <50UR-FUNNY 8257 8257 {t,a,v} speech 1741

Table 1: Comparison between UR-FUNNY and no-table humor detection datasets in the NLP community.Here, ‘#’,‘pos’, ’neg’ , ‘mod’ and ‘spk’ denote number,positive, negative, modalities and speaker respectively.

among active areas of research in both natural lan-guage processing and affective computing. No-table datasets in this area include “16000 One-Liners” (Mihalcea and Strapparava, 2005), “Punof the Day” (Yang et al., 2015), “PTT Jokes”(Chen and Soo, 2018), “Ted Laughter” (Chen andLee, 2017), and “Big Bang Theory” (Berteroet al., 2016). The above datasets have studiedhumor from different perspectives. For example,“16000 One-Liner” and “Pun of the Day” focuson joke detection (binary task), while “Ted Laugh-ter” focuses on punchline detection (whether ornot punchline triggers laughter). Similar to “TedLaughter”, UR-FUNNY focuses on punchline de-tection. Furthermore, punchline is accompaniedby context sentences to properly model the buildup of humor. Unlike previous datasets where neg-ative samples were drawn from a different do-main, UR-FUNNY uses a challenging negativesampling case where samples are drawn from thesame videos. Furthermore, UR-FUNNY is theonly humor detection dataset which incorporatesall three modalities of text, visual and acoustic.Table 1 shows a comparison between previouslyproposed datasets and UR-FUNNY dataset.

From modeling aspect, humor detection is doneusing hand-crafted and non-neural models (Yanget al., 2015), neural based RNN and CNN modelson Yelp (de Oliveira et al., 2017) and TED talksdata (Chen and Lee, 2017). Newer approacheshave used (Chen and Soo, 2018) highway net-works on “16000 One-Liner” and “Pun of theDay” datasets. There have been very few attemptsat using extra modalities alongside language fordetecting humor, mostly limited to adding simpleaudio features (Rakov and Rosenberg, 2013; Bert-ero et al., 2016) or the description of cartoon im-ages (Shahaf et al., 2015). Furthermore, these at-tempts have been restricted to certain topics anddomains (such as “Big Bang Theory” TV show

2048

(Bertero et al., 2016)).Multimodal Language Analysis: Studying nat-ural language from modalities of text, visual andacoustic is a recent research trend in natural lan-guage processing (Zadeh et al., 2018b). Notableworks in this area present novel multimodal neu-ral architectures (Wang et al., 2019; Pham et al.,2019; Hazarika et al., 2018; Poria et al., 2017;Zadeh et al., 2017), multimodal fusion approaches(Liang et al., 2018; Tsai et al., 2018; Liu et al.,2018; Zadeh et al., 2018a; Barezi et al., 2018)as well as resources (Poria et al., 2018a; Zadehet al., 2018c, 2016; Park et al., 2014; Rosas et al.,2013; Wollmer et al., 2013). Multimodal lan-guage datasets mostly target multimodal sentimentanalysis (Poria et al., 2018b), emotion recognition(Zadeh et al., 2018c; Busso et al., 2008), and per-sonality traits recognition (Park et al., 2014). UR-FUNNY dataset is similar to the above datasets indiversity (speakers and topics) and size, with themain task of humor detection. Beyond the scopeof multimodal language analysis, the dataset andstudies in this paper have similarities to other ap-plications in multimodal machine learning suchlanguage and vision studies, robotics, image cap-tioning, and media description (Baltrusaitis et al.,2019).

3 UR-FUNNY Dataset

In this section we present the UR-FUNNY dataset1. We first discuss the data acquisition process,and subsequently present statistics of the datasetas well as multimodal feature extraction and vali-dation.

3.1 Data Acquisition

A suitable dataset for the task of multimodal hu-mor detection should be diverse in a) speakers:modeling the idiosyncratic expressions of humormay require a dataset with large number of speak-ers, and b) topics: different topics exhibit differentstyles of humor as the context and punchline canbe entirely different from one topic to another.

TED talks 2 are among the most diverse ideasharing channels, in both speakers and topics.Speakers from various backgrounds, ethnic groupsand cultures present their thoughts through a

1UR-FUNNY Dataset: https://github.com/ROC-HCI/UR-FUNNY

2Videos on www.ted.com are publicly available fordownload.

widely popular channel 3. The topics of these pre-sentations are diverse; from scientific discoveriesto everyday ordinary events. As a result of diver-sity in speakers and topics, TED talks span acrossa broad spectrum of humor. Therefore, this plat-form presents a unique resource for studying thedynamics of humor in a multimodal setup.

TED videos include manual transcripts and au-dience markers. Transcriptions are highly reliable,which in turn allow for aligning the text and audio.This property makes TED talks a unique resourcefor newest continuous fusion trends (Chen et al.,2017). Transcriptions also include reliably anno-tated markers for audience behavior. Specifically,the “laughter” marker has been used in NLP stud-ies as an indicator of humor (Chen and Lee, 2017).Previous studies have identified the importance ofboth punchline and context in understanding andmodeling the humor. In a humorous scenario, con-text is the gradual build up of a story and punch-line is a sudden twist to the story which causeslaughter (Ramachandran, 1998). Using the pro-vided laughter marker, the sentence immediatelybefore the marker is considered as the punchlineand the sentences prior to punchline (but after pre-vious laughter marker) are considered context.

We collect 1866 videos as well as their tran-scripts in English from TED portal. These 1866videos are chosen from 1741 different speakersand across 417 topics. The laughter markup isused to filter out 8257 humorous punchlines fromthe transcripts (Chen and Lee, 2017). The con-text is extracted from the prior sentences to thepunchline (until the previous humor instances orthe beginning of video is reached). Using a sim-ilar approach, 8257 negative samples are chosenat random intervals where the last sentence is notimmediately followed by a laughter marker. Thelast sentence is assumed a punchline and similar tothe positive instances, the context is chosen. Thisnegative sampling uses sentences from the samedistribution, as opposed to datasets which use sen-tences from other distributions or domains as neg-ative sample (Yang et al., 2015; Mihalcea andStrapparava, 2005). After this negative sampling,there is a homogeneous 50% split in the datasetbetween positive and negative examples.

Using forced alignment, we mark the beginningand end of each sentence in the video as well as

3More than 12 million subscribers on YouTube https://www.youtube.com/user/TEDtalksDirector

https://github.com/ROC-HCI/UR-FUNNY

https://github.com/ROC-HCI/UR-FUNNY

www.ted.com

https://www.youtube.com/user/TEDtalksDirector

https://www.youtube.com/user/TEDtalksDirector

2049

(a) Distribution of Punchline Sentence Length in Number of Words

(c) Distribution of Context Length in Number of Sentences

(d) Distribution of Punchline (left) and Context (right) Sentence Duration in Seconds

(e) Ted Talk Categories

<4 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39>390

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0 1 2 3 4 >=50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

sleep

HIV

Guns

toy

Iran

God

pain

hack

pian

o

jazz

bees

novel funn

y

evil

ants

iraq

smell

AIDS

Islam

ebola

urban

PTSD

vocals

cyborg

meme

Brand

Brazil

glacier

Debat

e

Syria

suicide

markets

min

ing

tele

com

resources

singer

street art

orig

ami

violin

TED-Ed

(b) Distribution of Context Sentence Length in Number of Words<4 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39>39

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

humor non-humor humor non-humor

0

2

4

6

8

10

12

14

16

0

2

4

6

8

10

12

14

non-humorhumor

non-humorhumor

non-humorhumor

Figure 2: Overview of UR-FUNNY dataset statistics. (a) the distribution of punchline sentence length for humorand non-humor cases. (b) the distribution of context sentence length for humor and non-humor cases. (c) distri-bution of the number of sentences in the context. (d) distribution of the duration (in seconds) of punchline andcontext sentences. (e) topics of the videos in UR-FUNNY dataset. Best viewed in zoomed and color.

words and phonemes in the sentences (Yuan andLiberman, 2008). Therefore, an alignment is es-tablished between text, visual and acoustic. Uti-lizing this alignment, the timing of punchline aswell as context is extracted for all instances in thedataset.

3.2 Dataset Statistics

The high level statistics of UR-FUNNY datasetare presented in Table 2. Total duration of theentire dataset is 90.23 hours. There are a totalof 1741 distinct speakers and a total of 417 dis-tinct topics in the UR-FUNNY dataset. Figure 2.eshows the word cloud of the topics based on log-frequency of the topic. The top most five frequenttopics are technology, science, culture, global is-sues and design 4. There are in total 16514 videosegments of humor and not humor instances (equalsplits of 8257). The average duration of each datainstance is 19.67 seconds, with context an averageof 14.7 and punchline with an average of 4.97 sec-onds. The average number of words in punchlineis 16.14 and the average number of words in con-text sentences is 14.80.

Figure 2 shows an overview for some of theimportant statistics of UR-FUNNY dataset. Fig-ure 2.a demonstrates the distribution of punchlinefor humor and non-humor cases based on number

4Metadata collected from www.ted.com

Generaltotal #videos 1866total duration in hour 90.23total #distinct speakers 1741total #distinct topics 417total #humor instances 8257total #non-humor instances 8257total #words 965573total #unique words 32995total #sentences 63727avg length of sentences in words 15.15avg duration of sentences (s) 4.64Punchline#sentences in punchline 1avg #words in punchline 16.14avg #words in humorous punchline 15.17avg #words in non-humorous punchline 17.10avg duration of punchline (s) 4.97avg duration of humorous punchline (s) 4.58avg duration of non-humorous punchline (s) 5.36Contextavg total #words in context 42.33avg #words in context sentences 14.80avg #sentences in context 2.86avg #sentences in humorous context 2.82avg #sentences in non-humorous context 2.90avg duration of context (s) 14.7avg duration of humorous context (s) 13.79avg duration of non-humorous context (s) 15.62avg duration of context sentences (s) 4.25avg duration of humorous context sentences (s) 4.79avg duration of non-humorous context sentences (s) 4.52

Table 2: Summary of the UR-FUNNY dataset statis-tics. Here, ‘#’ denotes number, ‘avg’ denotes averageand ‘s’ denotes seconds

www.ted.com

2050

Train Val Test#humor instances 5306 1313 1638#not humor instances 5292 1313 1652#videos used 1166 300 400#speakers 1059 294 388avg #words in punchline 15.81 16.94 16.55avg #words in context 41.69 42.86 43.94avg #sentences in context 2.84 2.81 2.95punchline avg duration(second) 4.85 5.25 5.15context avg duration(second) 14.39 14.91 15.54

Table 3: Statistics of train, validation & test folds ofUR-FUNNY dataset. Here, ‘avg’ denotes average and‘#’ denotes number.

of words. There is no clear distinction betweenhumor and non-humor punchlines as both followsimilar distribution. Similarly, Figure 2.b showsthe distribution of number of words per contextsentence. Both humor and non-humor contextsentences follow the same distribution. Majority(≥ 90%) of punchlines have length less than 32.In terms of number of seconds, Figure 2.d showsthe distribution of punchline and context sentencelength in terms of seconds. Figure 2.c demon-strates the distribution of number of context sen-tences per humor and non-humor data instances.Number of context sentences per humor and non-humor case is also roughly the same. The statisticsin Figure 2 show that there is no trivial or degen-erate distinctions between humor and non-humorcases. Therefore, classification of humor versusnon-humor cases cannot be done based on simplemeasures (such as number of words); it requiresunderstanding the content of sentences.

Table 3 shows the standard train, validation andtest folds of the UR-FUNNY dataset. These foldsshare no speaker with each other - hence standardfolds are speaker independent (Zadeh et al., 2016).This minimizes the chance of overfitting to iden-tity of the speakers or their communication pat-terns.

3.3 Extracted Features

For each modality, the extracted features are asfollows:

Language (text): Glove word embeddings(Pennington et al., 2014) are used as pre-trainedword vectors for the text features. P2FA forcedalignment model (Yuan and Liberman, 2008)is used to align the text and audio on phonemelevel. From the force alignment, we extract thetiming annotations of context and punchline on

word level. Then, the acoustic and visual cuesare aligned on word level by interpolation (Chenet al., 2017).

Acoustic: COVAREP software (Degottex et al.,2014) is used to extraction acoustic features atthe rate of 30 frame/sec. We extract follow-ing 81 features: fundamental frequency (F0),Voiced/Unvoiced segmenting features (VUV)(Drugman and Alwan, 2011), normalized am-plitude quotient (NAQ), quasi open quotient(QOQ) (Kane and Gobl, 2013), glottal sourceparameters (H1H2, Rd,Rd conf) (Drugman et al.,2012; Alku et al., 2002, 1997), parabolic spectralparameter (PSP), maxima dispersion quotient(MDQ), spectral tilt/slope of wavelet responses(peak/slope), Mel cepstral coefficient (MCEP0-24), harmonic model and phase distortion mean(HMPDM 0-24) and deviations (HMPDD 0-12),and the first 3 formants. These acoustic featuresare related to emotions and tone of speech.

Visual: OpenFace facial behavioral analysistool (Baltrusaitis et al., 2016) is used to extractthe facial expression features at the rate of 30frame/sec. We extract all facial Action Units(AU) features based on the Facial Action CodingSystem (FACS) (Ekman, 1997). Rigid and non-rigid facial shape parameters are also extracted(Baltrusaitis et al., 2016). We observed that thecamera angle and position changes frequentlyduring TED presentations. However, for themajority of time, the camera stays focused onthe presenter. Due to the volatile camera work,the only consistently available source of visualinformation was the speaker’s face.

UR-FUNNY dataset is publicly available fordownload alongside all the extracted features.

4 Multimodal Humor Detection

In this section, we first outline the problem for-mulation for performing binary multimodal hu-mor detection on UR-FUNNY dataset. We thenproceed to study the UR-FUNNY dataset throughthe lens of a contextualized extension of MemoryFusion Network (MFN) (Zadeh et al., 2018a) - astate-of-the-art model in multimodal language.

2051

!",$,%

!",$,$

...

!",$,&'(

!),$,%

!),$,$

...

!),$,&'(

!*,$,%

!*,$,$

...

!*,$,&'(

+ = 1 + = 2

...

...

...

+ = /0

+ = 2Context

LSTMt LSTMv LSTMa

ℎ*,$ ℎ",$ ℎ),$

2

Figure 3: The structure of Unimodal Context Networkas outlined in Section 4.2.1. For demonstration pur-pose, we show the case for n = 2 (second context sen-tence). After n = NC , the output H (outlined by blue)is complete. Best viewed in color.

4.1 Problem Formulation

UR-FUNNY dataset is a multimodal dataset withthree modalities of text, visual and acoustic. Wedenote the set of these modalities as M = {t, v, a}.Each of the modalities come in a sequential form.We assume word-level alignment between modal-ities (Yuan and Liberman, 2008). Since frequencyof the text modality is less than visual and acous-tic (i.e. visual and acoustic have higher samplingrate), we use expected visual and acoustic descrip-tors for each word (Chen et al., 2017). Afterthis process, each modality has the same sequencelength (each word has a single visual and acousticvector accompanied with it).

Each data sample in the UR-FUNNY can be de-scribed as a triplet (l, P,C) with l being a binarylabel for humor or non-humor, P is the punchlineand C is the context. Both punchline and con-text have multiple modalities P = {Pm;m ∈ M},C = {Cm;m ∈ M}. If there are NC context sen-tences accompanying the punchline, then Cm =[Cm,1,Cm,2, . . . ,Cm,NC

] - are context sentencesstart from first sentence to the last (NC) sentence.KP is the number of words in the punchline and

KCn∣NCn=1 is the number of words in each of the

context sentences respectively. Using this nota-tion, Pm,k refers to the kth entry in the modalitym of the punchline. Similarly, Cm,n,k refers to thekth entry in the modality m of the nth context sen-tence.

Models developed on UR-FUNNY dataset aretrained on triplets of (l, P,C). During testing onlya tuple (P,C) is given to predict the l. Here l isthe label for laughter, specifically whether or notthe inputs P,C are likely to trigger a laughter.

4.2 Contextual Memory Fusion Baseline

Memory Fusion Network (MFN) is among thestate-of-the-art models for several multimodaldatasets (Zadeh et al., 2018a). We devise an exten-sion of the MFN model, named Contextual Mem-ory Fusion Network(C-MFN), as a baseline forhumor detection on UR-FUNNY dataset. This isdone by introducing two components to allow theinvolvement of context in the MFN model: 1) Uni-modal Context Network, where information fromeach modality is encoded using M Long-shortTerm Memories (LSTM), 2) Multimodal ContextNetwork, where unimodal context information arefused (using self-attention) to extract the multi-modal context information. We discuss the com-ponents of the C-MFN model in the followingsuch sections.

4.2.1 Unimodal Context NetworkTo model the context, we first model each modal-ity within the context. Unimodal Context Net-work (Figure 3) consists of M LSTMs, one foreach modality m ∈ M denoted as LSTMm. Foreach context sentence n of each modality m ∈ M ,LSTMm is used to encode the information into asingle vector hm,n. This single vector is the lastoutput of the LSTMm over Cm,n as input. Therecurrence step for each LSTM is the utteranceof each word (due to word-level alignment visualand acoustic modalities also follow this time-step).The output of the Unimodal Context Network isthe set H = {hm,n;m ∈M,1 ≤ n < NC}.

4.2.2 Multimodal Context NetworkMultimodal Context Network (Figure 4) learns amultimodal representation of the context based onthe output H of the Unimodal Context Network.Sentences and modalities in the context can formcomplex asynchronous spatio-temporal relations.For example, during the gradual buildup of the

2052

!

" = 1 " = 2

...

...

...

" = &'

Embedding

Self-Attention

Residual

Feed Forward

Residual

Enco

der

(!

Figure 4: The structure of Multimodal Context Net-work as outlined in Section 4.2.2. The output H of theUnimodal Context Network is connected to an encodermodule to get the multimodal output H . For the detailsof components outlined in orange please refer to theauthors’ original paper. (Vaswani et al., 2017). Bestviewed in color.

context, the speaker’s facial expression may be im-pacted due to an arbitrary previously uttered sen-tence. Transformers (Vaswani et al., 2017) are afamily of neural models that specialize in find-ing various temporal relations between their inputsthrough self-attention. By concatenating represen-tations hm∈M,n (i.e. for all M modalities of thenth context), self-attention model can be appliedto find asynchronous spatio-temporal relations inthe context. We use an encoder with 6 intermedi-ate layers to derive a multimodal representation Hconditioned on H . H is also spatio-temporal (asproduced output of encoders in a transformer are).The output of Multimodal Context Network is theoutput H of the encoder.

4.2.3 Memory Fusion Network (MFN)After learning unimodal (H) and multimodal (H)representations of context, we use a Memory Fu-sion Network (MFN) (Zadeh et al., 2018a) tomodel the punchline (Figure 5). MFN contains2 types of memories: a System of LSTMs withM unimodal memories to model each modality inpunchline, and a Multi-view Gated Memory which

System of LSTMs

Multi-view GatedMemory

Delta-memory Attention

ℎ",$%& %'( )"ℎ*,$%& %'( )*ℎ+,$%& %'( )+

,- ∈/,$ ,- ∈/,0 ,- ∈/,12. . .

34

4

)

Figure 5: The initialization and recurrence process ofMemory Fusion Network (MFN). The outputs of Uni-modal and Multimodal Context Networks (H and H)are used initializing the MFN neural components. Forthe details of components outlined in orange please re-fer to the authors’ original paper (Zadeh et al., 2018a).Best viewed in color.

stores multimodal information. We use a simpletrick to combine the Context Networks (Unimodaland Multimodal) with the MFN: we initialize thememories in the MFN using the outputs H (uni-modal representation) and H (multimodal repre-sentation). For System of LSTMs, this is doneby initializing the LSTM cell state of modality mwith Dm(hm,1≤n<NC

). Dm is a fully connectedneural network that maps the information fromhm,1≥j≥NC

(mth modality in context) to the cellstate of the mth LSTM in the System of LSTMs.The Multi-view Gated Memory is initialized basedon a non-linear projection D(H) where D is afully connected neural network. Similar to con-text where modalities are aligned at word level,punchline is also aligned the same way. There-fore a word-level implementation of the MFN isused, where a word and accompanying visual andacoustic descriptors of punchline are used as inputto the System of LSTMs at each time-step. TheMulti-view Gated Memory is updated iterativelyat every recurrence of the System of LSTMs usinga Delta-memory Attention Network.

The final prediction of humor is conditioned onthe last state of the System of LSTMs and Multi-view Gated Memory using an affine mapping withSigmoid activation.

2053

5 Experiments

In the experiments of this paper, our goal isto establish a performance baseline for theUR-FUNNY dataset. Furthermore, we aim tounderstand the role of context and punchline, aswell as role of individual modalities in the taskof humor detection. For all the experiments, weuse the proposed contextual extension of MemoryFusion Network (MFN), called C-MFN (Section4.2). Aside the proposed C-MFN model, thefollowing variants are also studied:

C-MFN (P): This variant of the C-MFN usesonly punchline with no contextual information.Essentially, this is equivalent to a MFN modelsince initialization trick is not used.

C-MFN (C): This variant of the C-MFN usesonly contextual information without punchline.Essentially, this is equivalent to removing theMFN and directly conditioning the humor pre-diction on the Unimodal and Multimodal ContextNetwork outputs (Sigmoid activated neuron afterapplying DM ;m ∈M on H and D on H).

The above variants of the C-MFN allow forstudying the importance of punchline and con-text in modeling humor. Furthermore, we com-pare the performance of the C-MFN variants inthe following scenarios: (T) a only text modal-ity is used without visual and acoustic, (T+V) textand visual modalities are used without acoustic,(T+A) text and acoustic modalities are used with-out visual, (A+V) only visual and acoustic modal-ities are used, (T+A+V) all modalities are used to-gether.

We compare the performance of C-MFN vari-ants across the above scenarios. This allows forunderstanding the role of context and punchline inhumor detection, as well as the importance of dif-ferent modalities. All the models for our experi-ments are trained using categorical cross-entropy.This measure is calculated between the output ofthe model and ground-truth labels. We also com-pare the performance of C-MFN to a Random For-est classifier as another strong non-neural base-line. We use the summary features of punchlineand context for the Random Forest classifier.

Modality T A+V T+A T+V T+A+VC-MFN (P) 62.85 53.3 63.28 63.22 64.47C-MFN (C) 57.96 50.23 57.78 57.99 58.45C-MFN 64.44 57.99 64.47 64.22 65.23

Table 4: Binary accuracy for different variants of C-MFN and training scenarios outlined in Section 5. Thebest performance is achieved using all three modalitiesof text (T), visual (V) and acoustic (A).

6 Results and Discussion

The results of our experiments are presented in Ta-ble 4. Results demonstrate that both context andpunchline information are important as C-MFNoutperforms C-MFN (P) and C-MFN (C) models.Punchline is the most important component for de-tecting humor as the performance of C-MFN (P) issignificantly higher than C-MFN (C).

Models that use all modalities (T+A+V) out-perform models that use only one or two modal-ities (T, T+A, T+V, A+V). Between text (T) andnonverbal behaviors (A+V), text shows to be themost important modality. Most of the cases, bothmodalities of visual and acoustic improve the per-formance of text alone (T+V, T+A).

Based on the above observations, each neuralcomponent of the C-MFN model is useful in im-proving the prediction of humor. The results alsoindicate that modeling humor from a multimodalperspective yields successful results. Furthermore,both context and punchline are important in under-standing humor.

The highest accuracy achieved by Random For-est baseline (after hyper parameter tuning & us-ing same folds as C-MFN) was 57.78%, whichis higher than random baseline but lower than C-MFN (65.23%). In addition, C-MFN achieveshigher accuracy than similar notable unimodalprevious work (58.9%) where only punchline andtextual information were used (Chen and Lee,2017) . The human performance 5 on the UR-FUNNY dataset is 82.5%.

The results from Table 4 demonstrate that whilea state-of-the-art model can achieve a reason-able level of success in modeling humor, there isstill a large gap between human-level performancewith state of the art. Therefore, UR-FUNNY

5This is calculated by averaging the performance of twoannotators over a shuffled set of 100 humor and 100 non-humor cases. The annotators are given the same input asthe machine learning models (similar context and punchline).The annotators agree 84% of times.

2054

dataset presents new challenges to the field ofNLP, specifically research areas of humor detec-tion and multimodal language analysis.

7 Conclusion

In this paper, we presented a new multimodaldataset for humor detection called UR-FUNNY.This dataset is the first of its kind in the NLP com-munity. Humor detection is done from the per-spective of predicting laughter - similar to (Chenand Lee, 2017). UR-FUNNY is diverse in bothspeakers and topics. It contains three modalitiesof text, visual and acoustic. We study this datasetthrough the lens of a Contextualized Memory Fu-sion Network (C-MFN). Results of our experi-ments indicate that humor can be better modeledif all three modalities are used together. Further-more, both context and punchline are important inunderstanding humor. The dataset and the accom-panying experiments are publicly available.

Acknowledgment

This research was supported in part by grantW911NF-15-1-0542 and W911NF-19-1-0029with the US Defense Advanced Research ProjectsAgency (DARPA) and the Army Research Office(ARO).

This material is based upon work partiallysupported by the National Science Foundation(Awards #1734868 #1722822) and National In-stitutes of Health. Any opinions, findings, andconclusions or recommendations expressed in thismaterial are those of the author(s) and do notnecessarily reflect the views of National ScienceFoundation or National Institutes of Health, andno official endorsement should be inferred.

ReferencesPaavo Alku, Tom Backstrom, and Erkki Vilk-

man. 2002. Normalized amplitude quotient forparametrization of the glottal flow. the Journal ofthe Acoustical Society of America, 112(2):701–710.

Paavo Alku, Helmer Strik, and Erkki Vilkman. 1997.Parabolic spectral parametera new method for quan-tification of the glottal flow. Speech Communica-tion, 22(1):67–79.

Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2019. Multimodal machinelearning: A survey and taxonomy. IEEE Transac-tions on Pattern Analysis and Machine Intelligence,41(2):423–443.

Tadas Baltrusaitis, Peter Robinson, and Louis-PhilippeMorency. 2016. Openface: an open source fa-cial behavior analysis toolkit. In 2016 IEEE Win-ter Conference on Applications of Computer Vision(WACV), pages 1–10. IEEE.

Elham J Barezi, Peyman Momeni, Pascale Fung, et al.2018. Modality-based factorization for multimodalfusion. arXiv preprint arXiv:1811.12624.

Dario Bertero, Pascale Fung, X Li, L Wu, Z Liu,B Hussain, Wc Chong, Km Lau, Pc Yue, W Zhang,et al. 2016. Deep learning of audio and languagefeatures for humor prediction. In LREC.

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, AbeKazemzadeh, Emily Mower, Samuel Kim, Jean-nette N Chang, Sungbok Lee, and Shrikanth SNarayanan. 2008. Iemocap: Interactive emotionaldyadic motion capture database. Language re-sources and evaluation, 42(4):335.

Lei Chen and Chong Min Lee. 2017. Predicting audi-ence’s laughter during presentations using convolu-tional neural network. In Proceedings of the 12thWorkshop on Innovative Use of NLP for BuildingEducational Applications, pages 86–90.

Minghai Chen, Sen Wang, Paul Pu Liang, Tadas Bal-trusaitis, Amir Zadeh, and Louis-Philippe Morency.2017. Multimodal sentiment analysis with word-level fusion and reinforcement learning. In Proceed-ings of the 19th ACM International Conference onMultimodal Interaction, pages 163–171. ACM.

Peng-Yu Chen and Von-Wun Soo. 2018. Humor recog-nition using deep learning. In Proceedings of the2018 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, Volume 2 (Short Pa-pers), volume 2, pages 113–117.

Gilles Degottex, John Kane, Thomas Drugman, TuomoRaitio, and Stefan Scherer. 2014. Covarepa col-laborative voice analysis repository for speech tech-nologies. In 2014 ieee international conference onacoustics, speech and signal processing (icassp),pages 960–964. IEEE.

Thomas Drugman and Abeer Alwan. 2011. Joint ro-bust voicing detection and pitch estimation based onresidual harmonics. In Twelfth Annual Conferenceof the International Speech Communication Associ-ation.

Thomas Drugman, Mark Thomas, Jon Gudnason,Patrick Naylor, and Thierry Dutoit. 2012. Detec-tion of glottal closure instants from speech signals:A quantitative review. IEEE Transactions on Audio,Speech, and Language Processing, 20(3):994–1006.

Rosenberg Ekman. 1997. What the face reveals: Basicand applied studies of spontaneous expression usingthe Facial Action Coding System (FACS). OxfordUniversity Press, USA.

2055

Randy Garner. 2005. Humor, analogy, and metaphor:Ham it up in teaching. Radical Pedagogy, 6(2):1.

William E Hauck and John W Thomas. 1972. The re-lationship of humor to intelligence, creativity, andintentional and incidental learning. The journal ofexperimental education, 40(4):52–55.

Devamanyu Hazarika, Soujanya Poria, Amir Zadeh,Erik Cambria, Louis-Philippe Morency, and RogerZimmermann. 2018. Conversational memory net-work for emotion recognition in dyadic dialoguevideos. In Proceedings of the 2018 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long Papers), volume 1, pages2122–2132.

John Kane and Christer Gobl. 2013. Wavelet maximadispersion for breathy to tense voice discrimination.IEEE Transactions on Audio, Speech, and LanguageProcessing, 21(6):1170–1179.

Paul Pu Liang, Ziyin Liu, Amir Zadeh, and Louis-Philippe Morency. 2018. Multimodal languageanalysis with recurrent multistage fusion. arXivpreprint arXiv:1808.03920.

Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshmi-narasimhan, Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2018. Efficient low-rank multi-modal fusion with modality-specific factors. arXivpreprint arXiv:1806.00064.

Rada Mihalcea and Carlo Strapparava. 2005. Makingcomputers laugh: Investigations in automatic humorrecognition. In Proceedings of the Conference onHuman Language Technology and Empirical Meth-ods in Natural Language Processing, pages 531–538. Association for Computational Linguistics.

Luke de Oliveira, ICME Stanford, and Alfredo LainezRodrigo. 2017. Humor detection in yelp reviews.

Sunghyun Park, Han Suk Shim, Moitreya Chatterjee,Kenji Sagae, and Louis-Philippe Morency. 2014.Computational analysis of persuasiveness in socialmultimedia: A novel dataset and multimodal predic-tion approach. In Proceedings of the 16th Interna-tional Conference on Multimodal Interaction, pages50–57. ACM.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for wordrepresentation. In Proceedings of the 2014 confer-ence on empirical methods in natural language pro-cessing (EMNLP), pages 1532–1543.

Hai Pham, Paul Pu Liang, Thomas Manzini, Louis-Philippe Morency, and Barnabas Poczos. 2019.Found in translation: Learning robust joint repre-sentations by cyclic translations between modalities.arXiv preprint arXiv:1812.07809.

Soujanya Poria, Erik Cambria, Devamanyu Haz-arika, Navonil Mazumder, Amir Zadeh, and Louis-Philippe Morency. 2017. Multi-level multiple at-tentions for contextual multimodal sentiment analy-sis. In 2017 IEEE International Conference on DataMining (ICDM), pages 1033–1038. IEEE.

Soujanya Poria, Devamanyu Hazarika, Navonil Ma-jumder, Gautam Naik, Erik Cambria, and Rada Mi-halcea. 2018a. Meld: A multimodal multi-partydataset for emotion recognition in conversations.arXiv preprint arXiv:1810.02508.

Soujanya Poria, Amir Hussain, and Erik Cambria.2018b. Multimodal Sentiment Analysis, volume 8.Springer.

Robert R Provine. 1992. Contagious laughter: Laugh-ter is a sufficient stimulus for laughs and smiles.Bulletin of the Psychonomic Society, 30(1):1–4.

Rachel Rakov and Andrew Rosenberg. 2013. ” sure, idid the right thing”: a system for sarcasm detectionin speech. In Interspeech, pages 842–846.

Vilayanur S Ramachandran. 1998. The neurology andevolution of humor, laughter, and smiling: the falsealarm theory. Medical hypotheses, 51(4):351–354.

Veronica Perez Rosas, Rada Mihalcea, and Louis-Philippe Morency. 2013. Multimodal sentimentanalysis of spanish online videos. IEEE IntelligentSystems, 28(3):38–45.

Dafna Shahaf, Eric Horvitz, and Robert Mankoff.2015. Inside jokes: Identifying humorous cartooncaptions. In Proceedings of the 21th ACM SIGKDDInternational Conference on Knowledge Discoveryand Data Mining, pages 1065–1074. ACM.

David Stauffer. 1999. Let the good times roll: Build-ing a fun culture. Harvard Management Update,4(10):4–6.

Yao-Hung Hubert Tsai, Paul Pu Liang, Amir Zadeh,Louis-Philippe Morency, and Ruslan Salakhutdinov.2018. Learning factorized multimodal representa-tions. arXiv preprint arXiv:1806.06176.

Robert A Vartabedian and Laurel Klinger Vartabedian.1993. Humor in the workplace: A communicationchallenge.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in Neural Information Pro-cessing Systems, pages 5998–6008.

Yansen Wang, Ying Shen, Zhun Liu, Paul Pu Liang,Amir Zadeh, and Louis-Philippe Morency. 2019.Words can shift: Dynamically adjusting word rep-resentations using nonverbal behaviors. arXivpreprint arXiv:1811.09362.

2056

Melissa B Wanzer, Ann B Frymier, and Jeffrey Irwin.2010. An explanation of the relationship betweeninstructor humor and student learning: Instructionalhumor processing theory. Communication Educa-tion, 59(1):1–18.

Martin Wollmer, Felix Weninger, Tobias Knaup, BjornSchuller, Congkai Sun, Kenji Sagae, and Louis-Philippe Morency. 2013. Youtube movie reviews:Sentiment analysis in an audio-visual context. IEEEIntelligent Systems, 28(3):46–53.

Diyi Yang, Alon Lavie, Chris Dyer, and Eduard Hovy.2015. Humor recognition and humor anchor extrac-tion. In Proceedings of the 2015 Conference on Em-pirical Methods in Natural Language Processing,pages 2367–2376.

Jiahong Yuan and Mark Liberman. 2008. Speakeridentification on the scotus corpus. Journal of theAcoustical Society of America, 123(5):3878.

Amir Zadeh, Minghai Chen, Soujanya Poria, ErikCambria, and Louis-Philippe Morency. 2017. Ten-sor fusion network for multimodal sentiment analy-sis. arXiv preprint arXiv:1707.07250.

Amir Zadeh, Paul Pu Liang, Navonil Mazumder,Soujanya Poria, Erik Cambria, and Louis-PhilippeMorency. 2018a. Memory fusion network for multi-view sequential learning. In Thirty-Second AAAIConference on Artificial Intelligence.

Amir Zadeh, Paul Pu Liang, Louis-Philippe Morency,Soujanya Poria, Erik Cambria, and Stefan Scherer.2018b. Proceedings of grand challenge and work-shop on human multimodal language (challenge-hml). In Proceedings of Grand Challengeand Workshop on Human Multimodal Language(Challenge-HML).

Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. 2016. Mosi: multimodal cor-pus of sentiment intensity and subjectivity anal-ysis in online opinion videos. arXiv preprintarXiv:1606.06259.

AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Po-ria, Erik Cambria, and Louis-Philippe Morency.2018c. Multimodal language analysis in the wild:Cmu-mosei dataset and interpretable dynamic fu-sion graph. In Proceedings of the 56th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers), volume 1, pages 2236–2246.

Documents

UR-FUNNY: A Multimodal Language Dataset for Understanding … · to mirror the laughter (Provine,1992). Modeling humor using a computational frame-work is inherently challenging due